As part of my bachelor-thesis I want to evaluate two Semantic Parsers on one common Dataset.
The two parsers are namely:
TRANX by Yin and Neubig: https://arxiv.org/abs/1810.02720 and
Seq2Action by Chen et al.: https://arxiv.org/abs/1809.00773
As I am relatively new to the topic I'm pretty much starting from scratch.
I don't know how to implement the parsers on my computer in order to run them on the ATIS or GeoQuery dataset.
It would be great if there is someone here who can help me with this problem as I have really no clue yet on how to approach this whole situation.
Related
I have recently had the need to create an ANTLR language grammar for the purpose of a transpiler (Converting one scripting language to another). It occurs to me that Google Translate does a pretty good job translating natural language. We have all manner of recurrent neural network models, LSTM, and GPT-2 is generating grammatically correct text.
Question: Is there a model sufficient to train on grammar/code example combinations for the purpose of then outputting a new grammar file given an arbitrary example source-code?
I doubt any such model exists.
The main issue is that languages are generated from the grammars and it is next to impossible to convert back due to the infinite number of parser trees (combinations) available for various source-codes.
So in your case, say you train on python code (1000 sample codes), the resultant grammar for training will be the same. So, the model will always generate the same grammar irrespective of the example source code.
If you use training samples from a number of languages, the model still can't generate the grammar as it consists of an infinite number of possibilities.
Your example of Google translate works for real life translation as small errors are acceptable, but these models don't rely on generating the root grammar for each language. There are some tools that can translate programming languages example, but they don't generate the grammar, work based on the grammar.
Update
How to learn grammar from code.
After comparing to some NLP concepts, I have a list of issue that may arise and a way to counter them.
Dealing with variable names, coding structures and tokens.
For understanding the grammar, we'll have to breakdown the code to its bare minimum form. This means understanding what each and every term in the code means. Have a look at this example
The already simple expression is reduced to the parse tree. We can see that the tree breaks down the expression and tags each number as a factor. This is really important to get rid of the human element of the code (such as variable names etc.) and dive into the actual grammar. In NLP this concept is known as Part of Speech tagging. You'll have to develop your own method to do the tagging, by it's easy given that you know grammar for the language.
Understanding the relations
For this, you can tokenize the reduced code and train using a model based on the output you are looking for. In case you want to write code, make use of a n grams model using LSTM like this example. The model will learn the grammar, but extracting it is not a simple task. You'll have to run separate code to try and extract all the possible relations learned by the model.
Example
Code snippet
# Sample code
int a = 1 + 2;
cout<<a;
Tags
# Sample tags and tokens
int a = 1 + 2 ;
[int] [variable] [operator] [factor] [expr] [factor] [end]
Leaving the operator, expr and keywords shouldn't matter if there is enough data present, but they will become a part of the grammar.
This is a sample to help understand my idea. You can improve on this by having a deeper look at the Theory of Computation and understanding the working of the automata and the different grammars.
What you're describing is 'just' learning structure of Context-Free Grammars.
I'm not sure if this approach will actually work for your case, but it's a long-standing problem in NLP: grammar induction for Context-Free Grammars. An example introduction how to tackle this problem using statistical learning methods can be found in Charniak's Statistical Language Learning.
Note that what I described is about CFGs in general, but you might want to check induction for LL grammars, because parser generators mostly use these types of grammars.
I know nothing about ANTLR, but there are pretty good examples of translating natural language e.g. into valid SQL requests: http://nlpprogress.com/english/semantic_parsing.html#sql-parsing.
I'm new to text analysis and scikit-learn. I am trying to vectorize tweets using sklearn's TfidfVectorizer class. When I listed the terms using 'get_feature_names()' after vactorizing the tweets, I see similar words such as 'goal', 'gooooal' or 'goaaaaaal' as different terms.
Question is, How can I make a single term 'goal' for such similar but different words using sklearn feature extraction techniques (or any other techniques) to get my results better?
In short - you can't. This is a very complex problem, going to the whole language understanding. Think for a moment - can you define exactly what does it mean to be "similar but different"? If you can't, computer will not be able to, too. What you can do?
You can come up with easy preprocessing rules, such as "remove any repeating letters", which will fix the "goal" problem. (this should not lead to any further problems)
You can use existing databases of synonyms (like wordnet) to "merge" the same meaning to the same tokens (this might cause false positive - you might "merge" words of different meaning due to the lack of context analysis)
You can build some language model and use it to embed your data in a lower-dimensional space forcing your model to merge similar meanings (using the well known heuristics "words that occur in similar contexts have similar meaning"). One of such technique is Latent Semantic Analysis but obviously there are many more possible.
We are already aware that the parsing time of Stanford Parser increases as the length of a sentence increases. I am interested in finding creative ways in which we prune the sentence such that the parsing time decreases without compromising on accuracy. For e.g. we can replace known noun phrases with one word nouns. Similarly can there be some other smart ways of guessing a subtree before hand, let's say, using the POS Tag information? We have a huge corpus of unstructured text at our disposal. So we wish to learn some common patterns that can ultimately reduce the parsing time. Also some references to publicly available literature in this regards will also be highly appreciated.
P.S. We already are aware of how to multi-thread using Stanford Parser, so we are not looking for answers from that point of view.
You asked for 'creative' approaches - the Cell Closure pruning method might be worth a look. See the series of publications by Brian Roark, Kristy Hollingshead, and Nathan Bodenstab. Papers: 1 2 3. The basic intuition is:
Each cell in the CYK parse chart 'covers' a certain span (e.g. the first 4 words of the sentence, or words 13-18, etc.)
Some words - particularly in certain contexts - are very unlikely to begin a multi-word syntactic constituent; others are similarly unlikely to end a constituent. For example, the word 'the' almost always precedes a noun phrase, and it's almost inconceivable that it would end a constituent.
If we can train a machine-learned classifier to identify such words with very high precision, we can thereby identify cells which would only participate in parses placing said words in highly improbable syntactic positions. (Note that this classifier might make use of a linear-time POS tagger, or other high-speed preprocessing steps.)
By 'closing' these cells, we can reduce both the the asymptotic and average-case complexities considerably - in theory, from cubic complexity all the way to linear; practically, we can achieve approximately n^1.5 without loss of accuracy.
In many cases, this pruning actually increases accuracy slightly vs. an exhaustive search, because the classifier can incorporate information that isn't available to the PCFG. Note that this is a simple, but very effective form of coarse-to-fine pruning, with a single coarse stage (as compared to the 7-stage CTF approach in the Berkeley Parser).
To my knowledge, the Stanford Parser doesn't currently implement this pruning technique; I suspect you'd find it quite effective.
Shameless plug
The BUBS Parser implements this approach, as well as a few other optimizations, and thus achieves throughput of around 2500-5000 words per second, usually with accuracy at least equal to that I've measured with the Stanford Parser. Obviously, if you're using the rest of the Stanford pipeline, the built-in parser is already well integrated and convenient. But if you need improved speed, BUBS might be worth a look, and it does include some example code to aid in embedding the engine in a larger system.
Memoizing Common Substrings
Regarding your thoughts on pre-analyzing known noun phrases or other frequently-observed sequences with consistent structure: I did some evaluation of a similar idea a few years ago (in the context of sharing common substructures across a large corpus, when parsing on a massively parallel architecture). The preliminary results weren't encouraging.In the corpora we looked at, there just weren't enough repeated substrings of substantial length to make it worthwhile. And the aforementioned cell closure methods usually make those substrings really cheap to parse anyway.
However, if your target domains involved a lot of repetition, you might come to a different conclusion (maybe it would be effective on legal documents with lots of copy-and-paste boilerplate? Or news stories that are repeated from various sources or re-published with edits?)
I'm not sure whats the best algorithm to use for the classification of relationships in words. For example in the case of a sentence such as "The yellow sun" there is a relationship between yellow and sun. THe machine learning techniques I have considered so far are Baynesian Statistics, Rough Sets, Fuzzy Logic, Hidden markov model and Artificial Neural Networks.
Any suggestions please?
thank you :)
It kind of sounds like you're looking for a dependency parser. Such a parser will give you the relationship between any word in a sentence and its semantic or syntactic head.
The MSTParser uses an online max-margin technique known as MIRA to classify the relationships between words. The MaltParser package does the same but uses SVMs to make parsing decisions. Both systems are trainable and provide similar classification and attachment performance, see table 1 here.
Like the user dmcer pointed out, dependency parsers will help you. There is tons of literature on dependency parsing you can read. This book and these lecture notes are good starting points to introduce the conventional methods.
The Link Grammar Parser which is sorta like dependency parsing uses Sleator and Temperley's Link Grammar syntax for producing word-word linkages. You can find more information on the original Link Grammar page and on the more recent Abiword page (Abiword maintains the implementation now).
For an unconventional approach to dependency parsing, you can read this paper that models word-word relationships analogous to subatomic particle interactions in chemistry/physics.
The Stanford Parser does exactly what you want. There's even an online demo. Here's the results for your example.
Your sentence
The yellow sun.
Tagging
The/DT yellow/JJ sun/NN ./.
Parse
(ROOT
(NP (DT The) (JJ yellow) (NN sun) (. .)))
Typed dependencies
det(sun-3, The-1)
amod(sun-3, yellow-2)
Typed dependencies, collapsed
det(sun-3, The-1)
amod(sun-3, yellow-2)
From your question it sounds like you're interested in the typed dependencies.
Well, no one knows what the best algorithm for language processing is because it hasn't been solved. To be able to understand a human language is to create a full AI.
Hoever, there have, of course, been attempts to process natural languages, and these might be good starting points for this sort of thing:
X-Bar Theory
Phrase Structure Rules
Noam Chomsky did a lot of work on natural language processing, so I'd recommend looking up some of his work.
How do recursive ascent parsers work? I have written a recursive descent parser myself but I don't understand LR parsers all that well. What I found on Wikipedia has only added to my confusion.
Another question is why recursive ascent parsers aren't used more than their table-based counterparts. It seems that recursive ascent parsers have greater performance overall.
The clasical dragon book explains very well how LR parsers work. There is also Parsing Techniques. A Practical Guide. where you can read about them, if I remember well. The article in wikipedia (at least the introduction) is not right. They were created by Donald Knuth, and he explains them in his The Art of Computer Programming Volume 5. If you understand spanish, there is a complete list of books here posted by me. Not all that books are in spanish, either.
Before to understand how they work, you must understand a few concepts like first, follows and lookahead. Also, I really recommend you to understand the concepts behind LL (descendant) parsers before trying to understand LR (ascendant) parsers.
There are a family of parsers LR, specially LR(K), SLR(K) and LALR(K), where K is how many lookahead they need to work. Yacc supports LALR(1) parsers, but you can make tweaks, not theory based, to make it works with more powerful kind of grammars.
About performance, it depends on the grammar being analyzed. They execute in linear time, but how many space they need depends on how many states do you build for the final parser.
I'm personally having a hard time understanding how a function call can be faster -- much less "significantly faster" than a table lookup. And I suspect that even "significantly faster" is insignificant when compared to everything else that a lexer/parser has to do (primarily reading and tokenizing the file). I looked at the Wikipedia page, but didn't follow the references; did the author actually profile a complete lexer/parser?
More interesting to me is the decline of table-driven parsers with respect to recursive descent. I come from a C background, where yacc (or equivalent) was the parser generator of choice. When I moved to Java, I found one table-driven implementation (JavaCup), and several recursive descent implementations (JavaCC, ANTLR).
I suspect that the answer is similar to the answer of "why Java instead of C": speed of execution isn't as important as speed of development. As noted in the Wikipedia article, table-driven parsers are pretty much impossible to understand from code (back when I was using them, I could follow their actions but would never have been able to reconstruct the grammar from the parser). Recursive descent, by comparison, is very intuitive (which is no doubt why it predates table-driven by some 20 years).
The Wikipedia article on recursive ascent parsing references what appears to be the original paper on the topic ("Very Fast LR Parsing"). Skimming that paper cleared a few things up for me. Things I noticed:
The paper talks about generating assembly code. I wonder if you can do the same things they do if you're generating C or Java code instead; see sections 4 and 5, "Error recovery" and "Stack overflow checking". (I'm not trying to FUD their technique -- it might work out fine -- just saying that it's something you might want to look into before committing.)
They compare their recursive ascent tool to their own table-driven parser. From the description in their results section, it looks like their table-driven parser is "fully interpreted"; it doesn't require any custom generated code. I wonder if there's a middle ground where the overall structure is still table-driven but you generate custom code for certain actions to speed things up.
The paper referenced by the Wikipedia page:
"Very fast LR parsing" (1986)
http://portal.acm.org/citation.cfm?id=13310.13326
Another paper about using code-generation instead of table-interpretation:
"Very fast YACC-compatible parsers (for very little effort)" (1999)
http://www3.interscience.wiley.com/journal/1773/abstract
Also, note that recursive-descent parsing is not the fastest way to parse LL-grammar-based languages:
Difference between an LL and Recursive Descent parser?