I am following approach of Distance supervision in article Distant Supervision for Relation Extraction using Ontology Class Hierarchy-Based Features.
I have already tokenized sentence for example:
Her most famous temple, the Parthenon, on the Acropolis in Athens takes its name from that title
and I also have lexical features from this sentence as you can see in table:
The question is how to create feature vector from this table, which can be passed to Logistic regression? Or is there any other classification method, which should be used?
A posible approach could be to use embeddings(for example word2vect). Example:
Embeddings article from tensorflow
Related
I'm working on finding similarities between short sentences and articles. I used many existing methods such as tf-idf, word2vec etc but the results are just okay. The most relevant measure which I found was word moving distance, however, its results are not that better than the other measures. I know it's a challenging problem, however, I am wondering if there are any new methods to find an approximate similarity more on a higher or concept level than just matching words. Especially, any alternative new methods like word moving distance which looks at slightly higher semantic of a sentence or article?
This is the most recent basing on a paper published 4 months ago.
Step 1:
Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list
Step 2 : Computing the sentence vector
The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.
A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here
Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.
This is the most simple and efficient method to compute the semantic similarity of sentences.
Obviously, this is a huge and busy research area, but I'd say there are two broad types of approaches you could look into:
First, there are some methods that learn sentence embeddings in an unsupervised manner, such as Le and Mikolov's (2014) Paragraph Vectors, which are implemented in gensim, or Kiros et al.'s (2015) SkipThought vectors, with an implementation on Github.
Then there also exist supervised methods that learn sentence embeddings from labelled data. The most recent one is Conneau et al.'s (2017), which trains sentence embeddings on the Stanford Natural Language Inference dataset, and shows these embeddings can be used successfully across a range of NLP tasks. The code is available on Github.
You might also find some inspiration in a blog post I wrote earlier this year on the topic of embeddings.
To be honest the best thing I know to use for this at the moment is AMR:
About AMR here: https://amr.isi.edu/
Documentation here: https://github.com/amrisi/amr-guidelines/blob/master/amr.md
You can use a system like JAMR (see here: https://github.com/jflanigan/jamr) to generate AMRs for your sentence and then you can use Smatch (see here: https://amr.isi.edu/eval/smatch/tutorial.html) to compare the similarity of the two generated AMRs.
What you are trying to do is very difficult and is an active ongoing area of research.
You can use semantic similarity with WordNet for each pair of nouns.
To have a quick look you can enter bird-noun-1 and chair-noun-1 and select wordnet at http://labs.fc.ul.pt/dishin/ it gives you:
Resnik 0.315625756544
Lin 0.0574161071905
Jiang&Conrath 0.0964964414156
The Python code is at: https://github.com/lasigeBioTM/DiShIn
I am working on Named Entity Recognition. I evaluated libraries, such as MITIE, Stanford NER , NLTK NER etc., which are built upon conventional nlp techniques. I also looked at deep learning models such as word2vec and Glove vectors for representing words in vector space, they are interesting since they provide the information about the context of a word, but specifically for the task of NER, I think its not well suited. Since all these vector models create a vocab and corresponding vector representation. If any word failed to be in the vocabulary it will not be recognised. Assuming that it is highly likely that a named entity is not present since they are not bound by the language. It can be anything. So if any deep learning technique have to be useful in such cases are the ones which are more dependent on the structure of the sentence by using standard english vocab i.e. ignoring named fields. Is there any such model or method available? Will CNN or RNN may be the answer for it ?
I think you mean texts of a certain language, but the named entities in that text may contain different names (e.g. from other languages)?
The first thing that comes to my mind is some semi-supervised learning techniques that the model is being updated periodically to reflect new vocabulary.
For example, you may want to use word2vec model to train the incoming data, and compare the word vector of possible NEs with existing NEs. Their cosine distance should be close.
I am new to Machine Learning.I am working on a project where the machine learning concept need to be applied.
Problem Statement:
I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.
I am not aware of which text classification technique need to applied for this.do we have any tools that can be used.
Please help.
Thanks in advance.
This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout
Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.
You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance
Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy
There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):
liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
They are also included with ML tools such as scikit-learna and WEKA.
With classifiers, it is still some operation to build and validate a pratically useful classifier. One of the challenges is to mix
discrete (boolean and enumerable)
and continuous ('numbers')
predictive variables seamlessly. Some algorithmic preprocessing is generally necessary.
Neural networks do offer the possibility of using both types of variables. However, they require skilled data scientists to yield good results. A straight-forward option is to use an online classifier web service like Insight Classifiers to build and validate a classifier in one go. N-fold cross validation is being used there.
You can represent the presence or absence of each word in a separate column. The outcome variable is desired category.
I am a newbie in NLP, just doing it for the first time.
I am trying to solve a problem.
My problem is I have some documents which are manually tagged like:
doc1 - categoryA, categoryB
doc2 - categoryA, categoryC
doc3 - categoryE, categoryF, categoryG
.
.
.
.
docN - categoryX
Here I have a fixed set of categories and any document can have any number of tags associated with it.
I want to train the classifier using this input, so that this tagging process can be automated.
Thanks
What you are trying to do is called multi-way supervised text categorization (or classification). Knowing the right question to ask is half the problem.
As for how this can be done, here are two references:
RCV1 : A New Benchmark Collection for Text Categorization
Research
Improved Nearest Neighbor Methods For Text Classification With
Language Modeling and Harmonic Functions
Most of classifier works on Bag of word model . There are multiple use case to get expected result.
Try out most general Multinomial naive base classifer with changing different input paramters and check out result.
Try variants of ML Naive base (http://scikit-learn.org/0.11/modules/naive_bayes.html)
You can check out sentence classifier along with considering sentence structures. Considering ngram concepts, you can try out with 2,3,4,5 gram models and check how result varies. Count vectorizer allows ngram, check out this link for example - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Based on dataset features, not a single classifier can be best for you scenario, you have to check out different use case, which fits best for you.
Most initial approach is, you get started with simple classifier using scikit learn.
Put each category as traning class and train the classifier with this classes
For any input docX, classifier with trained model
You will get probability result for each category
Now put some threshold like probability different between three most highest resulting category, if it matches the threshold consider those category as result for that input class.
its not clear what you have tried or what programming language you are using but as most have suggested try text classification with document vectors, bag of words (as long as there are words in the documents that can help with classification)
Here are some simple tools that can help get you started
Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java)
NLTK http://www.nltk.org (Python)
Mallet http://mallet.cs.umass.edu/ (command line & Java)
NUML http://numl.net/ (C#)
This is in the context of doing sentiment analysis using LingPipe machine learning tool. I have to classify if a sentence in a big paragraph has a positive/negative sentiment. I know of the following approach in LingPipe
Classify if the complete paragraph based on its polarity - negative or positive.
Here, I yet don't know the polarity at the sentence level. We are still at the paragraph level. How do I determine the polarity at the sentence level of a paragraph, of whether a sentence in a paragraph is a positive/negative sentence? I know that LingPipe is capable of classifying if a sentence is subjective/objective. So using this approach,,,,
,,,, should I
First train LingPipe on a large set of sentences that are subjective/objective.
Use the trained model to extract all subjective sentences out of a test paragraph.
Train a LingPipe classifier based on the extracted subjective sentences for polarity by manually labeling them as positive/negative.
Now used the trained polarity model and feed a test subjective sentence (that is done by passing a sentence through the trained subjective/objective) model, and then determine if the statement is positive/negative?
Does the above approach work? In the above proposed approach, we know that LingPipe is capable of accepting a large textual content (paragraph) for polarity classification. Will it do a good job if we just pass a single subjective sentence for polarity classification? I am confused!
You might want to take a look at the multi-level analysis approaches in the literature, e.g.
Li, S., et al. (2010). "Exploiting Combined Multi-level Model for Document Sentiment Analysis," 2010 International Conference on Pattern Recognition.
Yessenalina, A., et al. (2010). "Multi-level Structured Models for Document-level Sentiment Classification," Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1046–1056,MIT, Massachusetts, USA, 9-11 October 2010.
Multi-level analysis approaches are quite common in information retrieval, as in content indexing for vector space similarity search.
Environments such as Ling Pipe are a good way to get started but eventually you need to employ lower level, finer grained tools such as yura suggested.
Most machine leraning libraries including lingpipe are row based(object with planar features) . So if you want do some hierarchical classification with it you should denormolize you data. for example you can have features of paragrahp and sentence at same feature set. If you use by word only clasification you can create such features PARGRAPH_WORDX=true, SENTENCE_WORDX=true.
Some other toolkits allow you to express you model withot denormalisation, it is so called graphical models exampels are CRF, ACRF, Markov Models etc implementation of those you can find in mallet and Factorie.