Simple NER - IndexError: string index out of range error - named-entity-recognition

Here is a simple example of Named Entity Recognition (NER) using the named entity recognition tool in the Natural Language Toolkit (nltk) library in Python:
import nltk
Input text
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
Tokenize the text
tokens = nltk.word_tokenize(text)
Perform named entity recognition
entities = nltk.ne_chunk(tokens)
Print the named entities
print(entities)
When I run this code in my Jupyter Notebook, I get this error.
"IndexError: string index out of range"
Am I missing any installation? Please advise.
Expected output:
(PERSON Barack/NNP Obama/NNP)
(GPE Hawaii/NNP)
(ORGANIZATION United/NNP States/NNPS)

nltk.ne_chunk expects its input to be tagged tokens rather than just plain tokens, so I would recommend adding a tagging step between the tokenization and ne chunking via nltk.pos_tag. ne chunking still would give you every token, chunked by entities if there are any detected. Since you want only the entities, you can check for if there is a tree in a particular chunk. Like the following:
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
entities = [chunk for chunk in nltk.ne_chunk(tagged_tokens) if isinstance(chunk, nltk.Tree)]
for entity in entities:
print(entity)
Please note that this code doesn't give exactly the output you want. Instead it gives:
(PERSON Barack/NNP)
(PERSON Obama/NNP)
(GPE Hawaii/NNP)
(GPE United/NNP States/NNPS)

Related

What's your approach to extracting a summarized paragraph from multiple articles using GPT-3?

In the following scenario, what's your best approach using GPT-3 API?
You need to come out with a short paragraph, about a specific subject
You must base your paragraph on a set of articles, 3-6 articles, written in an unknown structure
Here is what I found to work well:
The main constraint is the open ai token limit in the prompt
Due to the constraint, I'd ask OPT-3 to parse unstructured data using the specific subject in the prompt request.
I'll then iterate each article and save it all into 1 string variable
Then, repeat it one last time but using the new string variable
If the article is too long, I'll cut it into smaller chunks
Of curse fine-tune, the model with the specific subject before will produce much better results
The temperature should be set to 0, to make sure GPT-3 uses only facts from the data source.
Example:
Let's say I want to write a paragraph about Subject A, Subject B, and Subject C. And I have 5 articles as references.
The open ai playground will look something like this:
Example Article 1
----
Subject A: example A for OPT-3
Subject B: n/a
Subject c: n/a
=========
Example Article 2
----
Subject A: n/a
Subject B: example B for GPT-3
Subject C: n/a
=========
Example Article 3
----
Subject A: n/a
Subject B: n/a
Subject c: example for GPT-3
=========
Article 1
-----
Subject A:
Subject B:
Subject C:
=========
... repeating with all articles, save to str
=========
str
-----
Subject A:
Subject B:
Subject C:
One may use the Python library GPT Index (MIT license) to summarize a collection of documents. From the documentation:
index = GPTTreeIndex(documents)
response = index.query("<summarization_query>", mode="summarize")
The “default” mode for a tree-based query is traversing from the top of the graph down to leaf nodes. For summarization purposes we will want to use mode="summarize".
 A summarization query could look like one of the following:
“What is a summary of this collection of text?”
“Give me a summary of person X’s experience with the company.”

How to parse files with arbitrary lengths?

I have a text file that I'd like to parse with records like this:
===================
name: John Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Sun Java Certified Programmer
Age: 29
===================
name: Bob Bear
Education: High School Diploma
Age: 18
===================
name: Jane Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Master's Degree
Education: AWS Certified Solution Architect Professional
Age: 25
As you can see, the fields in such a text file are fixed, but some of them repeat an arbitrary number of times. The records are separated by a fixed length ==== delimiter.
How would I write parsing logic this this sort of problem? I am think of using switch as it reads the start of the line, but the logic to handle multiple repeating fields baffles me.
A good way to approach this sort of problem is to "divide and conquer". That is, divide the overall problem into smaller sub-problems which are easier to manage and then solve each them individually. If you've planned properly then when you've finished each of the sub-problems you should have solved the whole problem.
Start by thinking about modeling. The document appears to contain a list of records, what should those records be called? What named fields should the records contain and what types should they have? How would you represent them idiomatically in go? For example, you might decide to call each record a Person with fields as such:
type Person struct {
Name string
Credentials []string
Age int
}
Next, think about what the interface (signature) of your parse function should look like. Should it emit an array of people? Should it use a visitor pattern and emit a person as soon as it's parsed? What constraints should drive the answer? Are memory or compute time constraints important? Does the user of the parser want any control over the parsing work such as canceling? Do they need metadata such as the total number of records contained in the document? Will the input always be from a file or a string, maybe from an HTTP request or a network socket? How will these choices drive your design?
func ParsePeople(string) ([]Person, error) // ?
func ParsePeople(io.Reader) ([]Person, error) // ?
func ParsePeople(io.Reader, func visitor(Person) bool) error // ?
Finally you can implement your parser to fulfill the interface that you've decided on. A straightforward approach here would be to read the input file line-by-line and take an action according to the contents of the line. For example (in pseudocode):
forEach line = inputFile.line
if line is a separator
emit or store the last parsed person, if present
create a new person to store parsed fields
else if line is a data field
parse the data
update the person with the parsed data
end
end
return the parsed records or final record, if emitting
Each line of pseudocode above represents a sub-problem that should be easier to solve than the whole.
Edit: Add explanation of why I just post a program as answer.
I am presenting a very straight forward implementation to parse the text you have given in your question. You accepted maerics answer and that is OK. I want to add some counter arguments to his answer, though. Basically the pseude-code in that answer is a non-compilable version of the code in my answer so we agree on the solution to this.
What I do not agree with is the over-engineering talk. I have to deal with code written by over-thinkers everyday. I urge you NOT to think about patterns, memory and time constraints or who might want what from this in the future.
Visitor pattern? That is something that is pretty much only useful in parsing programming languages, do not try to construct a use-case for it out of this problem. The visitor pattern is for traversing trees with different types of things in it. Here we have a list, not a tree, of things that are all the same.
Memory and time constraints? Are you parsing 5 GB of text with this? Then this might be a real concern. But even if you do, always write the simplest thing first. It will suffice. Throughout my career I only ever needed to use something other then a simple array or apply a complicated algorithm at most once per year. Still I see code everywhere that uses complicated data structures and algorithms without reason. This complicates change, is errorprone, sometimes makes things slower eventually! Do not use an observable list abstraction that notifies all observers whenever its contents change - but wait, let's add an update lock and unlock so we can control when to NOT notify everybody... No! Do not go down that route. Use a slice. Do your logic. Make everything read easy from top to bottom. I do not want to jump from A to B to C, chasing interfaces, following getters to finally find not a concrete data type but yet another interface. That is not the way to go.
These are the reasons why my code does not export anything, it is a self-contained, runnable example, a concrete solution to your concrete problem. You can read it, it is easy to follow. It is not heavily commented because it does not need to be. The three comments are not stating what happens but why it happens. Everything else is evident from the code itself. I left the note about the potential error in there on purpose. You know what kind of data you have, there is no line in there where this bug would be triggered. Do not write code to handle what cannot happen. If in the future someone would add a line without a text after the colon (remember, nobody will ever do this, do not worry about it), this will trigger a panic, point you to this line, you add another if or something, you are done. This code is future proof more then a program that tries to handle all kinds of different non-existent variations of the input.
The main point that I want to stretch is: write only what is necessary to solve the problem at hand. Everything beyond that makes your program hard to read and change, it will be untested and unnecessary.
With that said, here is my original answer:
https://play.golang.org/p/T6c51jSM5nr
package main
import (
"fmt"
"strconv"
"strings"
)
func main() {
type item struct {
name string
educations []string
age int
}
var items []item
var current item
finishItem := func() {
if current.name != "" { // handle the first ever separator
items = append(items, current)
}
current = item{}
}
lines := strings.Split(code, "\n")
for _, line := range lines {
if line == separator {
finishItem()
} else {
colon := strings.Index(line, ":")
if colon != -1 {
id := line[:colon]
value := line[colon+2:] // note potential bug if text has nothing after ':'
switch id {
case "name":
current.name = value
case "Education":
current.educations = append(current.educations, value)
case "Age":
age, err := strconv.Atoi(value)
if err == nil {
current.age = age
}
}
}
}
}
finishItem() // in case there was no separator at the end
for _, item := range items {
fmt.Printf("%s, %d years old, has educations:\n", item.name, item.age)
for _, e := range item.educations {
fmt.Printf("\t%s\n", e)
}
}
}
const separator = "==================="
const code = `===================
name: John Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Sun Java Certified Programmer
Age: 29
===================
name: Bob Bear
Education: High School Diploma
Age: 18
===================
name: Jane Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Master's Degree
Education: AWS Certified Solution Architect Professional
Age: 25`

HL7 Encoding/Separator Characters

In regards to HL7 pipe-delimited data, how exactly do the encoding characters (|^~\&) work?
Is the following example of fields, field repetitions, components and their sub-components correct when parsing raw HL7 data?
PID|1||||||||||||1234567890^somedata&moredata^TESTEMAIL#GMAIL.COM~0987654321
Field (|):
PID13 = 1234567890^somedata&moredata^TESTEMAIL#GMAIL.COM~0987654321
Field repetition (~):
PID13~1 = 1234567890^somedata&moredata^TESTEMAIL#GMAIL.COM
PID13~2 = 0987654321
Component (^):
PID13.1 = 1234567890
PID13.2 = somedata&moredata
PID13.3 = TESTEMAIL#GMAIL.COM
Sub-component (&):
PID13.2.1 = somedata
PID13.2.2 = moredata
PID13.3.1 = TESTEMAIL#GMAIL.COM
PID13.3.2 =
Without understanding the left-hand side structure you're trying to assign stuff to, it's impossible to tell you if you're doing it right.
There is however one right way to parse the segment/field in question.
Here's a link to the specs I reference here
From section 2.5.3 of the HL7v2.7 Standard:
Each field is assigned a data type that defines the value domain of the field – the possible values that it may
take.
If you pull up section 3.4.2.13 (PID-13) you'll see a breakdown of each component and subcomponent. Technically, the meaning of subcomponents and components can vary by field, but mostly they just vary by data type.
In your example, you don't treat the repetitions as separate instances of XTN data types. I would re-write using array syntax as so:
Field repetition (~):
PID13[0] = 1234567890^somedata&moredata^TESTEMAIL#GMAIL.COM
PID13[1] = 0987654321
Component (^):
PID13[0].1 = 1234567890
PID13[0].2 = somedata&moredata
PID13[0].3 = TESTEMAIL#GMAIL.COM
Sub-component (&):
PID13[0].2.1 = somedata
PID13[0].2.2 = moredata
The psuedo-code in the same specification section 2.6.1 may be helpful as well
foreach occurrence in ( occurrences_of( field ) ) {
construct_occurrence( occurrence );
if not last ( populated occurrence ) insert repetition_separator;
/* e.g., ~ */
}
It's important to remember that those different subcomponents have different meaning because PID-13 is a XTN type.
PID-13 is a problematic example because historically, the order of PID-13 mattered. The first repetition was "primary". Over time the field has also become the landing place for e-mail addresses, pager numbers, etc. So good luck trying to make sense out of real-world data.

How do you link back topics generated by LDA model to actual document

The LDA code generates topics say from 0 to 5 . Is there a standard way (a norm) used to link the generated topics and the documents themselves. Eg: doc1 is of Topic0 , doc5 is of topic Topic1 etc.
One way i can think of is to string search each of geenrated key words in each topic on the docs , is there a generic way or practice followed for this?
Ex LDA code - https://github.com/manhcompany/lda/blob/master/lda.py
I "collected some code", and this worked for me. Assuming you have a term frequency
tf_vectorizer = CountVectorizer("parameters of your choice")
tf = tf_vectorizer.fit_transform("your data)`
lda_model = LatentDirichletAllocation("other parameters of your choice")
lda_model.fit(tf)
create the topic-document matrix (the crucial step), and select the num_topic most important topics
doc_topic = lda_model.transform(tf)
num_most_important_topic = 2
dominant_topic = []
for ind_doc in range(doc_topic.shape[0]):
dominant_topic.append(sorted(range(len(doc_topic[ind_doc])),
key=lambda ind_top: doc_topic[ind_doc][ind_top],
reverse=True)[:num_most_important_topic])
This should give you an array of the num_most_important_topic topics. Good luck!

Scikit-learn: How to extract features from the text?

Assume I have an array of Strings:
['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']
I'd like to extract from this description features like:
item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...
Should I prepare the pre-defined known features first? Like
brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']
I am not sure that I need to use CountVectorizer and TfidfVectorizer here, it's more appropriate to have DictVictorizer, but how can I make dicts with keys extracting values from the entire string?
is it possible with scikit-learn's Feature Extraction? Or should I make my own .fit(), and .transform() methods?
UPDATE:
#sergzach, please review if I understood you right:
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
for d in data:
for brand in brands:
if brand in d:
# ok brand is found
for model in models:
if model in d:
# ok model is found
So creating N-loops per each feature? This might be working, but not sure if it is right and flexible.
Yes, something like the next.
Excuse me, probably you should correct the code below.
import re
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
features = {
'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']
# and other features
}
cat_data = [] # your categories which you should convert into numbers
not_found_columns = []
for line in data:
line_cats = {}
for col, features in features.iteritems():
for i, feature in enumerate(features):
found = False
if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.
found = True
break # current category is determined by a first occurence
# cycle has been end but feature had not been found. Make column value as default not existing feature
if not found:
line_cats[col] = 0
not_found_columns.append((col, line))
cat_data.append(line_cats)
# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.
Now you have column names with lines (not_found_columns) which was not found. View them, probably you forgot some features.
We can also write strings (instead of numbers) as categories and then use DV. In result the approaches are equivalent.
Scikit Learn's vectorizers will convert an array of strings to an inverted index matrix (2d array, with a column for each found term/word). Each row (1st dimension) in the original array maps to a row in the output matrix. Each cell will hold a count or a weight, depending on which kind of vectorizer you use and its parameters.
I am not sure this is what you need, based on your code. Could you tell where you intend to use this features you are looking for? Do you intend to train a classifier? To what purpose?

Resources