How To Train Stanford NER For Names That Include Spaces? - named-entity-recognition

The example training excersize labels single-term names after tokenizing with something like a simple split(' ').
I need to train for and recognize names that include spaces. How do I train the recognizer?
Example: "I saw a Big Red Apple Tree." -- How would I tokenize for training and then recognize "Big Red Apple Tree" instead of recognizing four separate words?
Will this work for the training data?
I\tO
saw\tO
a\tO
Big Red Apple Tree\tMyName
.\tO
Would the output from the recognizer look the same as that?
The training section in the FAQ says "The training file parser isn't very forgiving: You should make sure each line consists of solely content fields and tab characters. Spaces don't work."

The problem you are trying to solve belongs to phrase identification. There are different ways with which you can tag the words. For example, You can tag the words with IOB tags. Train the stanford ner model onto this newly created data. Write a post processing step to concatenate the predicted data.
For Example :
your training data should look like this:
I\tO
saw\tO
a\tO
Big\tB-MyName
Red\tI-MyName
Apple\tI-MyName
Tree\tO-MyName
.\tO<br/>
So Basically, you are using [ 0, B-MyName , I-MyName , O-MyName ] as tags.
I have solved similar problem and it works great. But make sure you have enough data to train it on.

Related

NLP / Rails sentiment search

I am building a tool from scratch that takes a sample of text and turns it into a list of categories. I am not using any libraries for this at the moment but am interested if anyone has experience in this territory as the hardest part that I'm struggling with is building in sentiment to the search. It's easy to word match but sentiment is much more challenging.
The goal would be to take something like this paragraph;
"Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"
and turn it into
categories = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
If possible I'd like to end up adding a filter for negative sentiment so that if the text said;
"I hate cooking"
'cooking' wouldn't be included in the categories.
Any help is greatly appreciated. TIA 👍
You seem to have at least two tasks: 1. Sequence classification by topics; 2. Sentiment analysis. [Edit, I only noticed now that you are using Ruby/Rails, but the code below is in Python. But maybe this answer is still useful for some people and the steps can be applied in any language.]
1. For sequence classification by topics, you can either define categories simply with a list of words as you said. Depending on the use-case, this might be the easiest option. If that list of words were too time-intensive to create, you can use a pre-trained zero-shot classifier. I would recommend the zero-shot classifier from HuggingFace, see details with code here.
Applied to your use-case, this would look like this:
# pip install transformers # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence = ["Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"]
candidate_labels = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
classifier(sequence, candidate_labels, multi_class=True)
# output:
{'labels': ['photography', 'spain', 'chocolate', 'travel', 'father', 'cooking'],
'scores': [0.9802802205085754, 0.7929317951202393, 0.7469273805618286, 0.6030028462409973, 0.08006269484758377, 0.005216470453888178]}
The classifier returns scores depending on how certain it is that a each candidate_label is represented in your sequence. It doesn't catch everything, but it works quite well and is fast to put into practice.
2. For sentiment analysis you can use HuggingFace's sentiment classification pipeline. In your use-case, this would look like this:
classifier = pipeline("sentiment-analysis")
sequence = ["I hate cooking"]
classifier(sequence)
# Output
[{'label': 'NEGATIVE', 'score': 0.9984041452407837}]
Putting 1. and 2. together:
I would probably probably (a) first take your entire text and split it into sentences (see here how to do that); then (b) run the sentiment classifier on each sentence and discard those that have a high negative sentiment score (see step 2. above) and then (c) run your labeling/topic classification on the remaining sentences (see 1. above).

Use pos tagging in bag of words

I'm using the bag of words for text classification.
Results aren't good enough, test set accuracy is below 70%.
One of the things I'm considering is to use POS tagging to distinguish the function of words. How is the to go approach to doing it?
I'm thinking on append the tags to the words, for example the word "love", if it's used as a noun use:
love_noun
and if it's a verb use:
love_verb
Test set accuracy near 70% is not that bad if you have hundreds of categories. You might want to measure overall precision and recall instead of accuracy.
What you proposed sounds good, which is an approach to add feature conjunctions as additional features. Here are a few suggestions:
Still keep your original features. That is to say, don't replace love with love_noun or love_verb. Instead, you have two features coming from love:
love, love_noun (or)
love, love_verb
If you need some sample code, you can start from nltk python package.
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("Love is a lovely thing"))
[('Love', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('lovely', 'JJ'), ('thing', 'NN')]
Consider using n-grams, maybe starting from adding 2-grams. For example, you might have "in" and "stock" and you might just remove "in" because it is a stop-word. If you consider 2-grams, you will get a new feature:
in-stock
which has a different meaning to "stock". It might help a lot in certain cases, for example, to distinguish from "finance" from "shopping".

Character n-gram vs word features in NLP

I'm trying to predict if reviews on yelp are positive or negative by performing linear regression using SGD.I tried two different feature extractors.The first was the character n-gram and the second was separating words by space. However, I tried different n values for the character n-gram, and found that the n value that gave me the best test error.I noticed that this test error (0.27 in my test data) was nearly identical to the test error from extracting the words separated by space.Is there a reason behind this coincidence?Shouldn't the character n-gram have a lower test error since it extracted more features than the word features?
Character n-gram: ex. n=7
"Good restaurant" => "Goodres" "oodrest" "odresta" "drestau" "restaur" "estaura" "stauran" "taurant"
Word features:
"Good restaurant" => "Good" "restaurant"
Looks like the n-gram method simply produced a lot of redundant, overlapping features which do not contribute to the precision.

which machine learning technique should be used for message classification

I have a dataset having customer message and final category one of example is following-
key message final category
1 i want customer care no i want to talk with ur team other
2 hi I 9986443603cjhh had qkuiv1uhqllljqvocally q illgi vq noclass
3 hai points not coming checking
like. The dataset is huge file with at least 20 final category type. Please suggest appropriate method to classify the data with a message which will be its final category. I am thinking of making feature_vector with message word and feed it into Bayesian would it be great? Or I have to use other technique.
Thanks a lot.
You can consider word-embedding.
You can download from here the embbedings (in this link- Glove, you can alternatively use word2vec).
The idea is that similar words will have similar vectors.
After you convert each word in your message to a vector you can average all the vectors (or, average using TF-IDF for better results) to get the vector-representation of your message.
Of course, words like qkuiv1uhqllljqvocally will not appear in the vocabulary.
To check your results, you can cluster(using 20-means clustering, if you have 20 classes) all your vectors to see that similar messages cluster to the same group.

Stanford GloVe's lack of punctuation?

I understand that GloVe trains vectors by noticing what frequently co-occurs, etc, but how come commas and periods are not included? For anything NLP, it seems like it would be an important feature to have a vector representation. I realize that something like (king - man = queen) would make no sense with (word - , = ?), but is there a way to represent punctuation marks and Numbers?
Is there a pre-made data set that includes such things? Would this even work?
I tried training GloVe with my own data set, but I ran into a problem with separating the punctuation (with a blank space) between words, etc.
pre-trained GloVe vectors do have punctuation, what makes you think they don't? At least Wikipedia 2014 + Gigaword 5 (6B tokens) set from http://nlp.stanford.edu/projects/glove/ have embeddings for "," ".", "-" and other included, just download these word vectors, and verify it yourseld, they are in plain text format, so its easy to do.
I have worked a bit with the word vectors used by Senna, and I am looking at the vocab list.
http://ml.nec-labs.com/senna/
I definitely see entries for punctuation.
A trick for handling numbers is to replace every digit with 0, and then learn a distribution for each pattern. For instance 1999 is mapped to 0000, 01-01-2015 is mapped to 00-00-0000, etc...
Senna has entries for these patterns like 0000, etc...
I will look over GloVe and try to update this answer soon...
It is totally ok and also common to also handle punctuation as single tokens for word vector generation. Also see for example word2vec papers. I assume that the prebuilt word2vec datasets have punctuations. And i'm sure the prebuilt glove vectors have also punctuations.
The are a lot of tokenizers separating the punctuations as seperate word. One I know for sure is the ARK Tweet Tokenizer.
I have used such a conversition for numbers and punctiotions. It is not a good way but slightly can be useful.
for numbers I convert all numbers to "NUM".
ex: 178 = "NUM" or 654 = "NUM"
for punctiotions I convert them to "PUNC".
ex: apple, orange, banana = apple "PUNC" orange "PUNC" banana
this is not a good solution but works someway.

Resources