Does "Consolidated Financial Data" and "Form 8-K" and similar legal terms count as named entities (for Named Entity Recognition)? - named-entity-recognition

Thanks for helping. I am doing some named entity tagging and came across a few ambiguous terms.
I follow mostly the CoNLL2003 annotation guidelines and MUC-7 named entity definition (other annotation guidelines mostly share the same idea with these guidelines)
https://www.clips.uantwerpen.be/conll2003/ner/
https://www-nlpir.nist.gov/related_projects/muc/proceedings/ne_task.html
For the example sentence "I do business in North and South America", how should I tag "North and South America"? The whole phrase as one "Location" entity, or "North" and "South America" as two "Location" entities?
I am tagging some legal reports (EDGAR dataset from US Securities and Exchange Commission), so I encounter phrases like "Form 10", "Form 8-K", which are types of legal forms for US companies. Should these be treated as "Miscellaneous" named entities?
In these forms there are chapter names such as "Consolidated Financial Statements", "Management's Discussion and Analysis", should these be tagged as "Miscellaneous"?
If yes then phrases like "Annual Report", "Annual Meeting for Shareholders", "Common Stock", "Restricted Stock Unit" becomes quite ambiguous. It can be argued that they are the name of a type of "legal documents", or the name of type of "Financial Instruments". But these phrases are generic somehow and do not pinpoint one specific entity. Should they be "Miscellaneous" or "Outside of a named entity" (Not a Named entity)?

Related

ER diagram help,

If I have an entity such as "Job" that has an attribute called "type" that specifies what type of job it would be, How would I add a relationship from it, to an entity such as "Recipient List" where the "Recipient List" is only applicable when "type" equals "delivery". What would this look like?

distant supervision: how to connect named entities to freebase (KB) relations

I'm trying to create a distant supervision corpus. Thus far I've assembled the data, and passed it through an NER system, so you can see an example below.
Original data:
<p>
Myles Brand, the president of the National Collegiate Athletic Association, said in a telephone interview that he had not been approached about whether the N.C.A.A. might oversee a panel for the major bowl games similar to the one that chooses teams for the men's and women's basketball tournaments.
</p>
Processed with Stanford NER:
<p>
<PERSON>Myles Brand</PERSON>, the president of the <ORGANIZATION>National Collegiate Athletic Association</ORGANIZATION>, said in a telephone interview that he had not been approached about whether the <ORGANIZATION>N.C.A.A.</ORGANIZATION> might oversee a panel for the major bowl games similar to the one that chooses teams for the men's and women's basketball tournaments.
</p>
Now here is a sentence which contains the person Myles Brand and the organization National Collegiate Athletic Association.
In Freebase we have these two entities sharing the relational bond of President as you can observe:
Freebase Relationship:
One would think the following code would do the trick, based on this question, but actually it doesn't, though as you can see from the picture above Freebase seems to maintain the relationship between these two entities in their corpus. Is this something that I am doing wrong?
I've been playing around with it in here.
[{
"type" : "/type/link",
"source" : { "id" : "/en/myles_brand" },
"master_property" : null,
"target" : { "id" : "/en/national_collegiate_athletic_association" },
"target_value" : null
}]
Moreover, I have many thousands of entity pairs, I guess I can write some short java program using the Freebase Java API to figure out the relationships for all of these in turn, does anyone have an example of a program like that which I could take a peek at?
The real thing I want to know though is once I have the relationships, what is the best way to assosicate those with a distance supervision corpus, I'm confused about how it all looks when finally it's been fit together.
You've got a couple of problems with the Freebase side of things. First, the relationship between Myles Brand and the NCAA isn't a direct one, but is mediated by a node representing his employment. This node has links to the employer, the employee, their title, the start date, and the end date. Second, the reflection queries have stronger directionality than the standard MQL queries and in this case Myles Brand is the target, not the source.
This query will show you the links to the /business/employment_tenure nodes:
[{
"type": "/type/link",
"source": {
"id": null
},
"master_property": null,
"target": {
"id": "/en/myles_brand"
}
}]
but it would need to be extended to deal with the multi-hop relationship that you're trying to find (and also extract the title).
Rather than doing this using reflection, you could test for the relationships directly if you've got a small enough set of them that you're interested in.
For example, you could test for an employment relationship (and fetch the title, if any) using:
[{
"/business/employment_tenure/person" : { "id" : "/en/myles_brand" },
"/business/employment_tenure/company" : { "id" : "/en/national_collegiate_athletic_association" },
"/business/employment_tenture/title": null
}]

What is the difference between Named Entity Recognition and Named Entity Extraction?

Please help me understand the difference between Named Entity Recognition and Named Entity Extraction.
Named Entity Recognition is recognition of the surface form of an Entity (person, place, organization), i.e. "George Bush" or "Barack Obama" are "PERSON" entities in this text string.
Entity Extraction will extract additional information as attributes from the text string. For example in the sentence "George W. Bush was president before President Obama" recognizing "Obama" as a person with attribute "title=president".
But if you look at software the distinction is often blurred.
There is no such a thing as Named Entity Extraction.
Paraphrasing better the sentence I would say that Named Entity Extraction is simple the process of concrete extracting previously recognized named entities. So, in a sense, there is no real theoretical knowledge that is relevant to this task, is just a matter of defining the mechanical operation.
If we are instead interested in extracting all the specific entities or the additional information regarding them from a piece text, than we have to look at information or knowledge extraction.
For information extraction you could for example ask to extract all the names of cities, or e-mail addresses, that appear in a corpus of documents. For such a task Named Entity Extraction could be used. You could even go much more generic, asking simply to extract general knowledge, for example in the form of relations (relation extraction).
For more details I would suggest the Natural Language Processing chapter of the book Artificial Intelligence: A Modern
Approach.

UISearchDisplayController/NSFetchedResultsController with custom sort order

I'm implementing a search feature in my app. I would like the user to look up a word simultaneously in multiple attributes of a given Entity.
Here is an example for an Entity with 3 String attributes: Person
(firstName, lastName, notes)
Let's use a mock dataset with 3 people:
"Emily", "Bridges", "She will be in town real soon."
"Johnny", "Williams", "This dude is really cool."
"Will", "Smith", "He does not remember anything for some reason."
Now, let's assume the user is looking up the occurence "will" and that we run a case insensitive search. All three previously described people will match the word "will" thanks to the use of an orPredicateWithSubpredicates
Ideally I would like the results to be displayed in this order for relevancy purposes:
"Will", "Smith", "He does not remember anything for some reason."
"Johnny", "Williams", "This dude is really cool."
"Emily", "Bridges", "She will be in town real soon."
For this search feature "firstName" is more relevant than "lastName" which are both more relevant than the "notes" attribute.
Since I'm using a UISearchDisplayController, I also use an NSFetchedResultsController which requires an NSSortDescriptor. The problem for me now is what attribute/key I am going to use to init the NSSortDescriptor?
I've been through many posts already and thought a transient property could help me with this issue, but I can't figure out how/when to set up this transient property which could be named something like "sortKey" and be set to these values:
1: For a match on "firstName"
2: For a match on "lastName"
3: For a match on "notes"
Eventually I guess I could try to run three different requests but then I'd have to give up using NSFetchedResultsController and all its magic...
I don't know whether I'm hitting the limits of NSFetchedResultsController or something but any pointer would be great, thanks!
Joss.

Rails best practices - How do you handle unique users that may have identical records?

How do you handle real name conflicts? Is there an established best practice or UI design pattern for disambiguating records like this? If authors can have many articles but more than one author can possibly have the same name how would you enable users to select the author they actually want when creating articles?
I can't dictate the author names be unique. The authors may have some other information that could individuate them (their articles or other optional fields).
To make this clearer - users are not authors. Users are people entering information about authors and articles. The only guaranteed information present for an author is the author's name. Other details are optional.
So if a user is creating a new record for an article they will have to either select or create an author for the many-to-many relationship between authors & articles.
With unambiguous rails examples such as the blog post category dropdown, like ryan bates uses in his railscasts, it is easy to create or update. If it exists link the blog post to it, if it doesn't then create and link the blog post to it.
My case is much messier. If it exists isn't that meaningful but I don't want to create a separate author entry for every article the author does.
Presumably you have a key that means you know which user authored which records, so it comes down to how you can best disambiguate them for your users.
Perhaps you need to ask your authors for a brief summary of themselves in their profile that you can use to disambiguate them on their terms. Alternatively depending on the type of article you might choose to describe them in terms of geography ("John Biggs, Florida", "John Biggs, California" ) or perhaps by the subject areas they choose to write about: "John Biggs, Java Expert", "John Biggs, Indonesia Specialist" and so on.
You could even just have "John Biggs (1)", "John Biggs (2)" and so on. I seem to recall this works alright for IMDB, who are a good example of a site that has had to sort this problem.
The important thing in usability where these types of thing are concerned is consistency- you need to always identify your authors in the same way so you don't have "John Biggs, Florida" and "John Biggs (2)" and you need to make sure that the identity you give to an author doesn't change once it is set up, so "John Biggs (2)" never becomes "John Biggs (5)" and your users can identify them whenever they see the disambiguated name as the same person who had that name previously.
One thing that worked for me on a past project is to have a text box in which users can type in the author's name. As they type, I update a div with possible matches - similar to Stack Overflow when you type a tag in the ignored or interesting box.
Users can click on a name in the div which opens the record in a new window - new window has a button, "select this author," which takes you back to the original page with that author in the textfield as Author Name (id).
If they submit the form with an ambiguous name, we have an extra step where we display matches, and they choose which one they mean.
I imagine you'd want something a little more streamlined if this is a data-entry type application, but on that project adding an author was an infrequent operation.
Several things to think about:
Can you filter by subject matter first?
For instance if John Jones (1) writes articles about genetics and John Jones (2) writes articles about computer networking, bu having the user select the general subejct matter first, you may be able to filter out many of the less applicable possible duplicate names.
(I would however have a button to see the unfiltered list becasue sometimes people write arrticles in a new subject matter). If you don't want to limit the choices perhaps a sort by subject matter or location could make it easier to find the right one.
When you show the list of possible duplicate names, show general information about the author including address and university affiliation and possibly the name of one article. Have a button to click on to show existing articles for any one of them. That way if you know the John Jones you want is located in FL, you only need to check out the three in Fl for articles not all 37 John Jones who wrote genetics articles.
Be aware that users are often lazy, they would rather just insert a new name than choose from a long list of existing names. So make it harder to insert a new name than to pick one. They have to go through the pick process first before they can enter a new name. We have an application which doesn't even show the button to add a new person until after you have done a search. Since names can have variations consider if you want to use fuzzy logic for your search. You might want to display J. Jones, Johnny Jones and Jon Jones as well as John Jones in your pick results.
Now a lot of this depends on how much knowledge your users have about the author ahead of time. If they know nothing beyond the name, they have no basis to judge between the 37 John Jones you have in the database. In this case it might be better just to accept the duplicates and return results based on a filtering by keywords or whatever you are storing about the article. Is it really necessary to make sure that the articles are ascribed to the correct John Jones, if you really know nothing about the author other than his name? Are you more concerned with the subject matter and name of the article or with having a list of all articles written by John Jones from UVA who is a professor of Political Science?
You don't! Names are a bad method of identification as you're finding out. You have a number of methods around this:
Add some form of unique identifier with normal users this would be a username to check for uniqueness. In your case, the method described above name(1) might have to do, if you really have no other information other than the name.
An alternative would be to use multiple attributes to make a composite key (e.g. name + dob)

Resources