Constituency parse tree after processing - parsing

I use Stanford CoreNLP to get constituency parse tree. I am wondering should I perform this after pre-processing or before pre-processing. In pre-processing I make the characters lower case, remove punctuations, remove stopwords (e.g., the, you're, ...) , remove numbers, keep just alphabets, and so on.
My task is getting a vector representation for each constituency parse tree by considering each leaf (i.e., token) as a vector embedding.
I am wondering how big the difference does it make if I get constituency parse tree after pre-processing?

I would run the full pipeline without doing your custom processing. The parser is trained on data that has not had your pre-processing applied to it.

Related

How can we use the dependency parser output to text embeddings? or Feature extractions from text?

Knowing the dependencies between various parts of the sentence
can add some information to existing knowledge from raw texts, Now the question is how can we use this to get a good feature representation, which can be fed into classifier such as logistic regression, sim etc. just as TfIdfvectorizer gives us a vector representation, for text documents. I'd like to know what different methods are there to get these kind of representation using output of dependency parser?

Stanford parser output doesn't match demo output

If I use the Stanford CoreNLP neural network dependency parser with the english_SD model, which performed pretty good according to the website (link, bottom of the page), it provides completely different results compared to this demo, which I assume is based on the LexicalizedParser (or at least any other one).
If I put the sentence I don't like the car in the demo page, this is the result:
If I put the same sentence into the neural network parser, it results in this:
In the result of the neural network parser, everything just depends on like. I think it could be due to the different POS-Tags, but I used the CoreNLP Maxent Tagger with the english-bidirectional-distsim.tagger model, so pretty common I think. Any ideas on this?
By default, we use the english-left3words-distsim.tagger model for the tagger which is faster than the bidirectional model but occasionally produces worse results. As both, the constituency parser which is used on the demo page, and the neural network dependency parser which you used, heavily rely on POS tags it is not really surprising that the different POS sequences lead to different parses, especially when the main verb has a function word tag (IN = prepositon) instead of a content word tag (VB = verb, base form).
But also note that the demo outputs dependency parses in the new Universal Dependencies representation, while the english_SD model parses sentences to the old Stanford Dependencies representation. For your example sentence, the correct parses are actually the same but you will see differences for other sentences, especially if they have prepositional phrases which are being treated differently in the new representation.

How can a tree be encoded as input to a neural network?

I have a tree, specifically a parse tree with tags at the nodes and strings/words at the leaves. I want to pass this tree as input into a neural network all the while preserving its structure.
Current approach
Assume we have some dictionary of words w1,w2.....wn
Encode the words that appear in the parse tree as n dimensional binary vectors with a 1 showing up in the ith spot whenever the word in the parse tree is wi
Now how about the tree structure? There are about 2^n possible parent tags for n words that appear at the leaves So we cant set a max length of input words and then just brute force enumerate all trees.
Right now all i can think of is to approximate the tree by choosing the direct parent of a leaf. This can be represented by a binary vector as well with dimension equal to number of different types of tags - on the order of ~ 100 i suppose.
My input is then two dimensional. The first is just the vector representation of a word and the second is the vector representation of its parent tag
Except this will lose a lot of the structure in the sentence. Is there a standard/better way of solving this problem?
You need a recursive neural network. Please see this repository for an example implementation: https://github.com/erickrf/treernn
The principle of a recursive (not recurrent) neural network is shown in this picture.
It learns representation of each leaf, and then goes up through the parents to finally construct the representation of the whole structure.
Encode tree structure: Think of Recurrent Neural Network, which you have one chain which can be construct by for loop. But here you have a tree. So you would need do some kind of loop with branch. Recursive function call might work with some Python overhead.
I suggest you build neural network with 'define by run' framework (like Chainer, PyTorch) to reduce overhead. Because your tree may have to be rebuild different for each data sample, which require to rebuilding computation graph.
Read Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks, with original Torch7 implementation here and PyTorch implementation, you may have some ideal.
For encoding a tag at node, I think an easiest way would be encoding them as you do with word.
For example, a node data is [word vector][tag vector]. If node is leaf, you have word, but may not have tag (you did not say that there is tag at leaf node), so leaf data representation is [word][zero vector] (or [word vector][tag vector]). The case inner node that does not have word=> [zero vector][tag vector]. Then, you have inner node and leaf with same vector dimension of data representation. You may treat them equally (or not :3)
Encode each leaf node using (i) the sequence of nodes that connects it to the root node and (ii) the encoding of the leaf node that comes before it.
For (i), use a recurrent network whose input is tags. Feed this RNN the root tag, the second level tag, ..., and finally the parent tag (or their embeddings). Combine this with the leaf itself (the word or its embedding). Now, you have a feature that describes the leaf and its ancestors.
For (ii), also use a recurrent network! Simply start by computing the feature described above for the left most leaf and feed it to a second RNN. Keep doing this for each leaf moving from left to right. At each step, the second RNN will give you a vector that represents the current leaf with its ancestors, the leaves that come before it and their ancestors.
Optionally, do (ii) bi-directionally and you will get a leaf feature that incorporates the whole tree!

How to parse CFG's with arbitrary numbers of neighbors?

I'm working on a project that is trying to use context-free grammars for parsing images. We are trying to construct trees of image segments, then use machine learning to parse images using these visual grammars.
I have found SVM-CFG which looks ideal, the trouble is that it is designed for string parsing, where each terminal in the string has at most two neighbors (the words before and after). In our visual grammar, each segment can be next to an arbitrary number of other segments.
What is the best way to parse these visual grammars? Specifically, can I encode my data to use SVM-CFG? Or am I going to have to write my own Kernel/parsing library?
SVM-CFG is a specific implementation of the cutting plane optimization algorithm used in SVM-struct (described here http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf, Section 4).
At each step, the cutting plane algorithm calls a function to find the highest scoring structured output assignment (in SVM-CFG this is the highest scoring parse).
For one-dimensional strings, SVM-CFG runs a dynamic programming algorithm to find the highest scoring parse in polynomial time.
You could extend SVM-struct to return the highest scoring parse for an image, but no polynomial-time algorithm exists to do this!
Here is a reference for a state-of-the-art technique that parses images: http://www.socher.org/uploads/Main/SocherLinNgManning_ICML2011.pdf. They run into the same problem for finding the highest scoring parse of an image segmentation, so they use a greedy algorithm to find an approximate solution (see section 4.2). You might be able to incorporate a similar greedy algorithm into SVM-struct.

Concurrent parsing algorithms

Are there existing parser algorithms (similar to LALR, SLR and LL) that can parse a single input, not just multiple inputs, in parallel?
Edit: Sorry, I wasn't really looking for research papers, more like, "There are compiler-compilers that generate concurrent parsers" or "This compiler for this language parses it in parallel"- real world examples.
There aren't any well known ones :-}
Much of the reason is the problem is described as parsing a string, presenting to the parser one token a time. That makes the problem sequential by definition, ugh.
One could imagine presenting the array of tokens to some parser all at once, and then have the parser parse substrings at various points across the array, stitching compatible trees for substrings together. The stitching process is likely to be complicated, but might be manageable if driven by an L(AL)R [better, a GLR] parser that swallowed nonterminals left-to-right after most of the parse trees for substrings were produced; think of this an an "accumulator".
[Shades, a quick Google search produces a 1990 Japanese paper on doing parallel GLR with what amounts to parallel Prolog]
You now have the problem of producing the array of tokens magically in parallel. Now you need a parallel lexer :-}
EDIT June 2013: I finally remembered McKeeman's 1982 paper on parallel LR parsing.
I'm working on a deterministic context free parsing algorithm with O(N) work complexity, O(log N) time complexity, fine-grained parallelism that is proportional to the length of the input string, and equivalent behavior to a LR parser. I'll be submitting it for peer review shortly.
The main idea is to process each character in the input stream independently, assume that it could match any rule, then piece together neighboring groups of characters until they unambiguously match a single rule. Once a rule is matched, it is filtered out by the algorithm. After all rules are matched, tokens are gathered together into a sequence.
There is some complexity involved in handling tokens with wildcards that could partially nest, and a post-processing step is needed to handle these and maintain the worst-case O(log N) complexity. This step probably is not needed in practice.

Resources