Parsing name and address from unstructured text

Parsing name and address from unstructured text - parsing

I am working on an application that requires me to parse unstructured
text. I need to parse name, address - area,city,country and zip code
from it. The addresses will be Indian.
Sample input:
"I am ABC working in XYZ company.
I am good at web designing having an experience of 3 years.
I live in kothrud,Pune-411038,Maharashtra."
Output:
NAME : ABC
AREA : KOTHRUD
CITY : PUNE
STATE : MAHARASHTRA
ZIP CODE : 411038
I am planning to use Apache ConceptMapper for parsing cities and states
for which I will have to build a dictionary set myself, but I guess that
can be done. For the zip code, I can use regex. I am stuck at how to
parse a name and area. Regex can be used to get name and area with
little hacking and lots of patterns but I am wondering if there is any
better solution available.
Is there any database I can query to, that would return addresses? I
haven't looked into Google maps/places but can you achieve address
parsing with them easily?
Any inputs would be highly appreciated.
Thanks.

The Google Geocoding API can help with this. It will return the map coordinates for a given address or an appropriate status code if no match is found.

Related

Convert ID from speech to text without spaces

I'm using Google Cloud Speech API with IBM Voice Gateway in order to interact with a VoiceBot through a phone.
If I say an identifier contening letters and numbers through the phone, the Google Cloud Speech converts it into string with spaces. For example, if I say "A1B2C3", it will convert it into the following string "a 1 b 2 c 3".
Do you know if there is way to avoid these useless spaces ?
Thanks for your help!
Lucas

I don't see any way in which you can eliminate spaces from the API response. What you could do is experiment with the available features, as this is probably your best chance to get a recognition more similar to what you are looking for.
For example: you can provide some sample hint phrases echoing your use case, indicate that the audio is a phone call, or use an enhanced model (although for the latter to be available you need to first opt in for data logging).
Honestly though, for your case, it might be better if you post process the returned string (e.g. with a simple "a 1 b 2 c 3".replace(' ','') ).

Coreference Resolution by CoreNLP CorefChainAnnotation.class not working

I am using core nlp library to find coreference in my text
Tyson lives in New York City with his wife and their two children.
when I am running this on Stanford CoreNLP Online demo it's giving me correct output
but when I run this text on my machine it's returning null on this line of code
Map graph = document.get(CorefChainAnnotation.class);
Thank you

Look into this complete example - http://blog.pengyifan.com/resolve-coreference-using-stanford-corenlp/. I guess you are missing something as i am unable to understand the exact reason from the code you provided.

Map bpmn to wsdl

My task is to take a bpmn 2.0 xml file and map it as good as possible (with a certain error rate) to available web services. For example when my bpmn file explains the process of buying a pizza, i give 10€ and get back 1 pizza. Now it should map that bpmn to the webservice that needs an of type int with the name "money" etc.
How is that even possible? I searched for a few hours now and came up with the following:
I found https://github.com/camunda/camunda-bpm-platform and can easily use it to parse a plain .bpmn file to a java object structure which i can then query. Easy.
After parsing the xml notation i should analyze it and search for elements that input data and elements that output data for this are the only things i can map to wsdl (wsdl only describes the structure of the webservice: names of variables, types of variables, number of variables). Problem: I do not find any 1:1 elements i can easily declare as "when this bpmn element is used, it 100% means that the process is getting some input named x". What should i do here? What can i map?
I found ws-bpel. As far as i understand i can somehow transfer bpmn to ws-bpel which should be better modeling of the process and more easily be mappable to a wsdl (?). Camunda however doesn't offer this functionality and i am restricted to open source software.
Any suggestions what i should do?

localising postal / physical address display from database fields

Can anyone point me to a list of international postal / residential / delivery address format templates that use some kind of parseable standard vocabulary for address parts?
The ideal list contains a country code then a format using replaceable tokens so I can substitute database address fields into a template to produce something printable in the local format.
for example
NZ | [first_name] [family_name]\n[company_name]\n[street_address]\n[city] [post_code]\n[country]
AU | [first_name] [family_name]\n[company_name]\n[street_address]\n[city]\n[state] [post_code]\n[country]
US | etc
UK | etc
Background: I used to have a simple freetext field to accept addresses. Moving to support vCard download, which requires addresses to be broken down into specific fields. Thats all fine: we can do the migration. I'm looking for a way to display the fields in the "correct" order for each country. thanks for your help!

This MSDN page has the information in the format you need and seems accurate, but covers only 33 countries. Maybe they are enough.
The Universal Postal Union offers all the information you need for a lot of countries here. This is top quality information; however, it is split across as many PDF documents as there are countries and is not in the format you need.
This page provides the information in a slightly more accessible form. As far as I can judge, it is accurate (and contains a lot of valuable info), but I can't speak to its quality nor its currentness.

Google have a JSON-based API that they use for their Android address input field library that contains this kind of formatting information.
The field you'd be interested in is fmt. There doesn't seem to be any formal documentation on the format they use, but a proposal to include this information as part of the Unicode CLDR has matching fields (scroll down to "Detailed Breakdown of elements"); there are also some clues in Google's libaddressinput source code.

Parsing a full name into its constituents

We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.

I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.

Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm

The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.

Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.

Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser

I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }

A basic algorithm could do the following:
First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues

"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name.
Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.

As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Parsing name and address from unstructured text - parsing

The Google Geocoding API can help with this. It will return the map coordinates for a given address or an appropriate status code if no match is found.

Related

Convert ID from speech to text without spaces

Coreference Resolution by CoreNLP CorefChainAnnotation.class not working

Map bpmn to wsdl

localising postal / physical address display from database fields

Parsing a full name into its constituents

Categories

Resources