Dependency Parse Tree Matching in python - machine-learning

I am working on Answer Sentence Selection Problem, and I want to compare dependency trees of two sentences. I am retrieving the dependency tree from spaCy and now I want to compare dependency trees. Is there any way or library in Python that I could use?

Related

Best way to compare meaning of text documents?

I'm trying to find the best way to compare two text documents using AI and machine learning methods. I've used the TF-IDF-Cosine Similarity and other similarity measures, but this compares the documents at a word (or n-gram) level.
I'm looking for a method that allows me to compare the meaning of the documents. What is the best way to do that?
You should start reading about word2vec model.
use gensim, get the pretrained model of google.
For vectoring a document, use Doc2vec() function.
After getting vectors for all your document, use some distance metric like cosine distance or euclidean distance for comparison.
This is very difficult. There is actually no computational definition of "meaning". You should dive into text mining, summarization and libraries like gensim, spacy or pattern.
In my opinion, the more readily useable libraries available out there ie. higher return on investesment (ROI), that is if you are a newbie you might want to look at tools around chatbots they want to extract from natural language structured data. That is what is the most similar to "meaning". One example free software tool to achieve that is rasa natural language understanding.
The drawback of such tools is that they somewhat work but only in the domain where they were trained and prepared to work. And in particular they do not aim at comparing documents like you want.
I'm trying to find the best way to compare two text documents using AI
You must come up with a more precise task and from there find out which technic apply best to your use case. Do you want to classify documents in predefined categories. Do you to compute some similarity between two documents? Given an input document, do you want to find most similar documents in a database. Do you want to extract important topics or keywords in the document? Do you want to summarize the document? Is it an abstractig summary or key phrase extraction?
In particular, there is no software that allows to extract somekind of semantic fingerprint from any document. Depending on the end goal, the way to achieve it might be completly different.
You must narrow the precise goal you are trying to achieve; From there, you will be able to ask another question (or improve this one) to describe precisly your goal.
Text understanding is AI-Complete. So, just saying to the computer "tell me something about this two documents" doesn't work.
Like other have said, word2vec and other word embeddings are tools to achieve many goals in NLP but it only a mean for an end. You must define the input and output of the system you are trying to design to be able to start working on the implementation.
There is two other Stack Overflow communities that you might want to dig:
Linguistics
Data Science
Given the tfidf value for each token in your corpus (or the most meaningful ones) you can compute a sparse representation for a document.
This is implemented in the sklearn TFIDFVectorizer.
As other users have pointed out, this is not the best solution to your task.
You should take into account embeddings.
The easiest solution consists in using an embedding at the words level, such as the one provided by the FastText framework.
Then you can create an embedding for the whole document by summing together the embedding of the single words which compose it.
An alterative consists in training an embedding directly at the document level, using some Doc2Vec framework such as the gensim or DL4J one.
Also you can use LDA Or LSI Models for text corpus. these methods(and other methods like wor2vec and doc2vec) can summarize documents to fixed length vectors with respect to it's meaning and topics that this document belongs to.
read more:
https://radimrehurek.com/gensim/models/ldamodel.html
I heard there are three approaches from Dr. Golden:
- Cosine Angular Separation
- Hamming Distance
- Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)
These methods are based on semantic similarity.
I also heard some company used tool called Spacy to summarize document to compare each other.

How can I plug in my own NER into the Stanford NLP parser pipeline?

I am trying to provide parsing on some biodiversity literature but I have my own NER tool that I have developed to identify species names. I need to plug this into the parser pipeline somehow to enhance the dependency parsing but I am not sure how to go about it and haven't been able to find anything that gives me an indication how to approach it.
My tagger is just a simple dictionary look up that runs in Python 3.6.
Any ideas?

Customizing the Named Entity Recogntition model in Azure ML

Can we customize the Named Entity Recognition (NER) model in Azure ML Studio with a separate training dataset? What I want to do is to find out non-English names from a text. (Training dataset includes the set of names that going to use for training)
Unfortunately, this module's ability to perform NER with a custom set of entities is planned for the future, but not currently available.
If you're familiar with Python and willing to put in the extra footwork, you might consider using the Natural Language Toolkit (NLTK). Sujit Pal has a nice blog post and sample code describing the creation of a custom NER with that package. You may be able to train an NLTK NER model and apply it to your data of interest from within an Execute Python Script module on Azure ML.

How to convert from Stanford Universal Dependencies to Phrase Grammar?

In my application I am using Stanford CoreNLP for parsing english text into a graph data structure (Universal Dependencies).
After some modifications of the graph I need to generate a natural language output for which I am using SimpleNLG: https://github.com/simplenlg/simplenlg
However SimpleNLG is using Phrase Grammar.
Therefore in order to successfully use SimpleNLG for natural language generation I need to convert from Universal Dependencies into Phrase Grammar.
What is the easiest way of achieving this?
So far I have only come across this article on this topic:
http://delivery.acm.org/10.1145/1080000/1072147/p14-xia.pdf?ip=86.52.161.138&id=1072147&acc=OPEN&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E6D218144511F3437&CFID=642131329&CFTOKEN=21335001&acm=1468166339_844b802736ce07dab89064efb7f8ede9
I am hoping that someone might have some more practical code examples to share on this issue?
Phrase-structure trees contain more information than dependency trees and therefore you cannot deterministically convert dependency trees to phrase-structure trees.
But if you are using CoreNLP to parse the sentences, take a look at the parse annotator. Unlike the dependency parser, this parser also outputs phrase-structure trees, so you can use this annotator to directly parse your sentences to phrase-structure trees.

Naive Bayesian for Topic detection using "Bag of Words" approach

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?
Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.
How should i go about doing this ? Is my approach the right one ?
Which programming language would be best suited for the implementation ?
Existing Implementations of Naive Bayes
You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.
Bootstrapping Classification from Keywords
It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.
This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

Resources