NLTK for Named Entity Recognition

NLTK for Named Entity Recognition - machine-learning

I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out:
sentence = "Let's meet tomorrow at 9 pm";
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print nltk.ne_chunk(pos_tags, binary=True)
I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my above code:
(S (GPE Let/NNP) 's/POS meet/NN tomorrow/NN at/IN 9/CD pm/NN)
Can someone help me understand if I am missing something or NLTK is just not mature enough to tag time and date properly. Thanks!

The default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus (http://catalog.ldc.upenn.edu/LDC2005T09). It has not been trained to recognise dates and times, so you need to train your own classifier if you want to do that.
Have a look at http://mattshomepage.com/articles/2016/May/23/nltk_nec/, the whole process is explained very well.
Also, there is a module called timex in nltk_contrib which might help you with your needs. https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py

Named entity recognition is not an easy problem, do not expect any library to be 100% accurate. You shouldn't make any conclusions about NLTK's performance based on one sentence. Here's another example:
sentence = "I went to New York to meet John Smith";
I get
(S
I/PRP
went/VBD
to/TO
(NE New/NNP York/NNP)
to/TO
meet/VB
(NE John/NNP Smith/NNP))
As you can see, NLTK does very well here. However, I couldn't get NLTK to recognise today or tomorrow as temporal expressions. You can try Stanford SUTime, it is a part of Stanford CoreNLP- I have used it before I it works quite well (it is in Java though).

If you wish to correctly identify the date or time from the text messages you can use Stanford's NER.
It uses the CRF(Conditional Random Fields) Classifier. CRF is a sequential classifier. So it takes the sequences of words into consideration.
How you frame or design a sentence, accordingly you will get the classified data.
If your input sentence would have been Let's meet on wednesday at 9am., then Stanford NER would have correctly identified wednesday as date and 9am as time.
NLTK supports Stanford NER. Try using it.

Related

NLP / Rails sentiment search

I am building a tool from scratch that takes a sample of text and turns it into a list of categories. I am not using any libraries for this at the moment but am interested if anyone has experience in this territory as the hardest part that I'm struggling with is building in sentiment to the search. It's easy to word match but sentiment is much more challenging.
The goal would be to take something like this paragraph;
"Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"
and turn it into
categories = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
If possible I'd like to end up adding a filter for negative sentiment so that if the text said;
"I hate cooking"
'cooking' wouldn't be included in the categories.
Any help is greatly appreciated. TIA 👍

You seem to have at least two tasks: 1. Sequence classification by topics; 2. Sentiment analysis. [Edit, I only noticed now that you are using Ruby/Rails, but the code below is in Python. But maybe this answer is still useful for some people and the steps can be applied in any language.]
1. For sequence classification by topics, you can either define categories simply with a list of words as you said. Depending on the use-case, this might be the easiest option. If that list of words were too time-intensive to create, you can use a pre-trained zero-shot classifier. I would recommend the zero-shot classifier from HuggingFace, see details with code here.
Applied to your use-case, this would look like this:
# pip install transformers # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence = ["Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"]
candidate_labels = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
classifier(sequence, candidate_labels, multi_class=True)
# output:
{'labels': ['photography', 'spain', 'chocolate', 'travel', 'father', 'cooking'],
'scores': [0.9802802205085754, 0.7929317951202393, 0.7469273805618286, 0.6030028462409973, 0.08006269484758377, 0.005216470453888178]}
The classifier returns scores depending on how certain it is that a each candidate_label is represented in your sequence. It doesn't catch everything, but it works quite well and is fast to put into practice.
2. For sentiment analysis you can use HuggingFace's sentiment classification pipeline. In your use-case, this would look like this:
classifier = pipeline("sentiment-analysis")
sequence = ["I hate cooking"]
classifier(sequence)
# Output
[{'label': 'NEGATIVE', 'score': 0.9984041452407837}]
Putting 1. and 2. together:
I would probably probably (a) first take your entire text and split it into sentences (see here how to do that); then (b) run the sentiment classifier on each sentence and discard those that have a high negative sentiment score (see step 2. above) and then (c) run your labeling/topic classification on the remaining sentences (see 1. above).

Use pos tagging in bag of words

I'm using the bag of words for text classification.
Results aren't good enough, test set accuracy is below 70%.
One of the things I'm considering is to use POS tagging to distinguish the function of words. How is the to go approach to doing it?
I'm thinking on append the tags to the words, for example the word "love", if it's used as a noun use:
love_noun
and if it's a verb use:
love_verb

Test set accuracy near 70% is not that bad if you have hundreds of categories. You might want to measure overall precision and recall instead of accuracy.
What you proposed sounds good, which is an approach to add feature conjunctions as additional features. Here are a few suggestions:
Still keep your original features. That is to say, don't replace love with love_noun or love_verb. Instead, you have two features coming from love:
love, love_noun (or)
love, love_verb
If you need some sample code, you can start from nltk python package.
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("Love is a lovely thing"))
[('Love', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('lovely', 'JJ'), ('thing', 'NN')]
Consider using n-grams, maybe starting from adding 2-grams. For example, you might have "in" and "stock" and you might just remove "in" because it is a stop-word. If you consider 2-grams, you will get a new feature:
in-stock
which has a different meaning to "stock". It might help a lot in certain cases, for example, to distinguish from "finance" from "shopping".

which machine learning technique should be used for message classification

I have a dataset having customer message and final category one of example is following-
key message final category
1 i want customer care no i want to talk with ur team other
2 hi I 9986443603cjhh had qkuiv1uhqllljqvocally q illgi vq noclass
3 hai points not coming checking
like. The dataset is huge file with at least 20 final category type. Please suggest appropriate method to classify the data with a message which will be its final category. I am thinking of making feature_vector with message word and feed it into Bayesian would it be great? Or I have to use other technique.
Thanks a lot.

You can consider word-embedding.
You can download from here the embbedings (in this link- Glove, you can alternatively use word2vec).
The idea is that similar words will have similar vectors.
After you convert each word in your message to a vector you can average all the vectors (or, average using TF-IDF for better results) to get the vector-representation of your message.
Of course, words like qkuiv1uhqllljqvocally will not appear in the vocabulary.
To check your results, you can cluster(using 20-means clustering, if you have 20 classes) all your vectors to see that similar messages cluster to the same group.

Best data analytic techniques/models for personal project

I'm not really sure how to word this and I'm sorry if the formatting is wrong, but I'm trying to get a foundation to be able to tackle this problem myself.
I am trying to develop a prediction algorithm for a set of data of "Hip Surgery Patients" that looks like:
Readmission Time | Symptom Code | Symptom Note | Related
6 | 2334 | swelling in hip | Yes
12 | 1324 | anxiety | Maybe
8 | 2334 | swelling in hip | Yes
30 | 1111 | Headaches | No
3 | 7934 | easily bruising | Yes
For context, doctors can identify whether or not a given "Symptom Code" is related to the "Hip Replacement Surgery" that occurred X days ago. I have about 200 entries in my data set that match this format, and my goal is to be able to match results in the given set as well as predict new results in the "Related" Column (with certainty statistics on predicted results) based on new inputs. For example given:
Input: 20 | 2334 | swelling in hip
Output: Yes (90% confidence)
I'm very new to Data Analytics and Machine Learning so I would really just like to get some kind of pointers of things to look up or where to get started on my research. I imagine there's an optimal function/model that would handle this best but as I said I'm very new to the topic so I have no clue as to where to start. Since I have a relatively small data set I'm looking for a technique that isn't easily over trained if possible
I really appreciate any help and pointers on where to get started.

Based on your data snippet, it looks like a multiclass classification problem (the 3-classses being Yes, Maybe or No).
Your columns (asides related) will be your features which can be reduced to numeric representations. For instance:
For the Symptom Note Feature, you can have a mapping as seen below:
Swelling in hip = 1
Anxiety = 2
Swelling = 3
Easily Bruised = 4
Obviously this can work if you have a definite number of symptoms in this columns. Machine learning algorithms usually work with numbers so your features will be extracted from the raw data into numeric form. Once that has been done, you can feed the data into a classification algorithm. The naive Bayes algorithm is a great place to start.
Scikit learn (if you can work with python) has a great introductory example on a 3class classification task where all the features are numbers. It tries to classify different types of iris flowers based on the sepal length, sepal width, petal length and petal width.
The full tutorial can be found here: Supervised learning: predicting an output variable from high-dimensional observations
Is it feasible to get additional data? If it is, I will suggest you get more. 200 instances is quite small and may not properly represent the feature space. In addition, it will be useful to split the data into a training and test set further reducing the quantity used while training. You can also opt for a K-Folds Cross validation.
Summarily: navigate to that scikit-learn page, try out the flower classification example. Once you're familiar with the environment; your data will need some cleaning and feature extraction. You will need to answer questions like what's the meaning of the Readmission Time and Symptom Code? Are those values over a specified range with a special internal meaning or they are just random numbers assigned like an id.

I would recommend transcribing your data into ARFF format and then use this with Weka. Weka is a program with many machine learning algorithms you can experiment with, it also has a very simple user interface so is good for beginners! Once you have found an algorithm that works well you can save your trained model and use this to predict new instances!

What is wrong with 20newsgroup-18828 dataset?

i am currently using 20NewsGroup-18828 dataset in weka. I have selected a subset of document with 100 per category (total 2000 documents) which i divided in a split of 70%(training) and 30%(testing) when i tried classification with naive bayes, SVM and K-nn its accuracy is very low.Here are list of operations i am performing on the dataset
StringtoWordVector (indexing and term weighting with Tf-Idf, Smart stopword list, Snowball stemmer)
Dimensionality reduction with feature selection (InformationGain)
Dimensionality reduction with feature transformation (Random Projection)
When i use original dataset with 20,000 docs it performs well but it has duplications like some documents are classified in multiple categories.
Did any one used this dataset or can someone tell me what i am doing wrong ?

Regarding differences between datasets
The main difference between 20newsgroup ( o riginal dataset) and 20newsgroup-18828 (m odified) is:
o contains duplicates, m does not
o contains trivial problem, as it includes newsgroup identification header, m includes only from and subject headers (so it is still easy version of the problem, but harder than o), for example:
FILE 51126 regarding atheism
in original form:
Path:
cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.centerline.com!uunet!olivea!sgigate!sgiblab!adagio.panasonic.com!nntp-server.caltech.edu!keith
From: keith#cco.caltech.edu (Keith Allan Schneider) Newsgroups:
alt.atheism Subject: Re: >>>>>>Pompous ass Message-ID:
<1pi9btINNqa5#gap.caltech.edu> Date: 2 Apr 93 20:57:33 GMT References:
<1ou4koINNe67#gap.caltech.edu> <1p72bkINNjt7#gap.caltech.edu>
<93089.050046MVS104#psuvm.psu.edu> <1pa6ntINNs5d#gap.caltech.edu>
<1993Mar30.210423.1302#bmerh85.bnr.ca> <1pcnqjINNpon#gap.caltech.edu>
Organization: California Institute
of Technology, Pasadena Lines: 9 NNTP-Posting-Host:
punisher.caltech.edu
kmr4#po.CWRU.edu (Keith M. Ryan) writes:
>>Then why do people keep asking the same questions over and over?
>Because you rarely ever answer them.
Nope, I've answered each question posed, and most were answered
multiple times.
keith
In modified form (-18828 version)
From: keith#cco.caltech.edu (Keith Allan Schneider)
Subject: Re: >>>>>>Pompous ass
kmr4#po.CWRU.edu (Keith M. Ryan) writes:
>>Then why do people keep asking the same questions over and over?
>Because you rarely ever answer them.
Nope, I've answered each question posed, and most were answered
multiple times.
keith
As you can see, original data is so simple, that you actually can find the name of the label inside of the file... this is why you will always get good scores on such data, even if your whole processing concept is very, very wrong.
So the question is not "what is wrong with 20newsgroup-18828" but rather "what is wrong with the original dataset".
General ideas
First, why would you assume that anything is wrong? You are performing very arbitrary methods of data representation processing (two different dimensionality reduction steps) on the very small (70 training vectors per class) dataset. There is nothing wrong with this data, this is a simple NLP data, which, as most of the NLP tasks require large amounts of data, and "naive" (not NLP-based) dimensionality reduction techniques have no guarantees to actually help.
Secod, even if you do something wrong, in 90% os cases (arbitrary high number) the error is between what user think he does, and what he actually does. So describing what you do won't lead to any help, you have to show what you exactly do (by giving a reproducible example).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart