iOS - Check pasteboard for valid Mailing Address - ios

I am looking for some guidance for how I could check the pasteboard in iOS for a valid mailing address.
If someone pastes
1234 Apple Street
New York, NY 10011
It parses each part of the string to fill in Address, City, State and Zip. It could be any address and It would be ideal if it could be found inside a longer string.
For example
Meet me at 1234 Apple Street New York, NY 10011 See you there!
Still will parse the correct Address, City, State and Zip.
Any help would be much appreciated!
-Wes

I was a developer at SmartyStreets. We were kind of crazy about street addresses, and street addresses drove me crazy (especially parsing them). It's a two-way street. (Am I done with the street puns?)
First, let's talk about the case where the address is all by itself, because that's easier, albeit still difficult...
Please reference this other question and answer about the very same thing. I also strongly encourage you follow the links to related questions in both the question and the answer. Parsing addresses is a can of worms, but it's not impossible. It's just really hard to do it reliably.
Notice in the answer to that question how many different formats valid addresses can appear in. What guarantees do you have that the user will type it in any of those? And that's only a few. There are others. Consider military, PO box, rural route, and other "special" addresses that don't adhere to the typical format. What about addresses that have a two-or-three-word city name? What about addresses that use a grid system like 100 N 500 E, or secondary numbers like suite, apartment, floor, etc? What about addresses with "1/2", hyphens (as a required punctuation), etc? Addresses missing zip codes or city/state?
All of these and more could be valid. And that's only for US addresses.
If all your addresses, or even most of them (which isn't the case), came in the form like you proposed above, as an example:
[Primary Number] [Street Name] [Any of these street suffixes]
[City Name Followed by a Comma], [State Abbreviation] [5-digit ZIP code]
Then this would be quite easy. Wouldn't that be nice?
You could try to write a regular expression like this guy or that guy, but that only works if addresses are a regular language. They're not regular, and regular expressions are not the answer.
There are a few services which can do this for you because they have a master list (kind of), and the software has to meet rigorous certification standards.
Obviously, since I work at SmartyStreets, I'm prone to suggest starting your search for an answer there. You can try some freeform addresses on the homepage (just fill out the "Street" field). But be aware of a few things that will probably always be an issue. LiveAddress API will be able to parse street addresses for you, most of the time. Shop around, but this should give you an idea.
Now your second question: extract a street address from a string of text. This has been extensively covered elsewhere on S.O. and the interwebs, so I won't go into a lot of detail. Basically, to do this reliably, you'll probably need some Natural Language Processing and human interaction to confirm or correct the best guess.
Don't ever assume these things about un-standardized addresses:
Starts with a number
Ends with a number
Everything between the two numbers is an address
Has a ZIP code present
No more than 2 numbers will be in an address
It's unambiguous
It exists
A street suffix will always be present
It's spelled correctly
...etc.
Again, refer to some other linked posts about this issue. You can make guesses, but always always always have a human confirm the guess if you do that. (Some Mac apps do this. If they detect an address, it will get highlighted, and you can add that address to your contacts. Unfortunately I've seen false positives a lot, and it also misses them a lot.)
Good luck!

I also work at SmartyStreets, and since I'm not a developer I'm not bound by any constraints such as "it can't be done" or "there's no way to do it reliably". In fact the ideas that I come up with may not even always be possible, but, I'm a problem-solver, a solution-finder, and this particular problem absolutely has a solution.
You'll need the following: a little regex, knowledge of a scripting language (python, php, whatever you prefer) and access to an address validation tool (this is required so that you know when you get it right).
So, let's start with the example sentence:
Meet me at 1234 Apple Street New York, NY 10011 See you there!
We can be sure that every address has a beginning and an end. (you can take that to the bank!)
So, if you run a regular expression that looks for the beginning of the address within the string you can eliminate everything before the address begins. Here's a regex that will do just that:
(^(.*(?=p\.?o\.? box|h\.?c\.?r\.? |c\.?m\.?r\.?)|^[^0-9]+))
This will give you back the following:
1234 Apple Street New York, NY 10011 See you there!
Now, you're halfway there but you'll need to loop through the remaining string. Another assumption that you can certainly make is that an address will never be longer than 328 charachters long (I made up that number, but you get the picture. An address has to have an end as well and you can shorten the string by determining the max acceptable USPS address length.)
You're going to loop through the address string until you get a valid address out of it. To do this, start at the beginning and move one word to the right with each additional permutation. This is where the address validation service come in handy, because you have no idea where the address ends and that's what you need to know. So, each permutation you generate from the string (remember, you're starting from the left side) will be sent for validation. Since no valid address can have fewer than two words, You'll start there. Here are the permutations from the example address as well as the validation results (I'm trying each address by entering it in the address line of the address search box on smartystreets.com:
1234 Apple ==> fail
1234 Apple Street ==> fail
1234 Apple Street New ==> fail
1234 Apple Street New York ==> fail
1234 Apple Street New York, NY ==> Bingo, valid address match. No need to keep iterating.
Now, obviously this is not a valid address but you can try the same thing with a real address and you'll get the same results. Obviously this isn't the most sophisticated method to extract a valid address from a string but it certainly works. And, since SmartyStreets allows you to send up to 100 addresses per query, you could permute the address string up to 99 times and get the results back in under 300ms. This won't work with every address, as you'll certainly find out, but it can very easily handle a large majority of them, regardless of how obscured the address is within the text string.
So, we started with this meet me at 1234 Apple Street New York, NY 10011 See you there! and within less than half a second came up with this 1234 Apple Street New York, NY 10011-1000.
Pretty cool huh? It even sounds really easy coming from a non-programmer.
Let's try it with a real address:
Meet me at 4219 jon young orlando fl 32839 See you there!
Apply regex and you get:
4219 jon young orlando fl 32839 See you there!
Permute, iterate, validate:
4219 jon ==> fail
4219 jon young ==> fail
4219 jon young orlando ==> fail
4219 jon young orlando fl ==> Bingo, valid address match.

Related

Twilio - is there a place to find out formatting requirements?

I have been all through the Twilio API docs, and I can't seem to find what requirements they have for certain inputs. For instance, I found that if I am searching for a number in a city, the city can't have any punctuation in it, but sometimes the abbreviated name is required. For instance, searching for in_locality="Ft. Worth" and in_locality="Ft Worth" won't work, but in_locality="Fort Worth" does. Oddly though, other abbreviations are somewhat required, like in_locality="St George" works, but in_locality="Saint George" does not, nor does in_locality="St. George"
Are there any rules for this written down anywhere, or do I just have to figure out every permutation of an abbreviated city name by magic, and try them all?

Parsing six figure Ordnance Survey Grid Reference input

An extensive search has produced no answer to the question, "Is there a class or function that parses input for soundness relating to UK Ordnance Survey Grid References".
The UK is mapped by the UK Ordnance Survey who produce detailed maps of the United Kingdom with many types of referencing. One of these is the commonly used six figure Grid Reference, we are at SO896804.
We already use a postcode (zip) checker to make sure that the information entered into the postcode field is sound, but we can't find the same for the OS Grid Reference.
Does such a Grid Reference function exist, or do we down tools and write one?
Thank you.
Since you tagged this "parsing", rather than using some GIS-like tag, I'd say that a reasonably valid OS grid reference corresponds to the regular expression:
(H[PTUWXYZ]|N[ABCDFGHJKLMNORQSTUWXYZ]|OV|S[CDEHJKMNOPRSTUVWXYZ]|T[AFGLMQRV])([0-9]{6})
If you were prepared to accept 4-digit 100ha blocks as well as the six-digit 1ha blocks, you could replace the second parenthesized expression with ([0-9]{4}|[0-9]{2}).
Of course, some of the two-letter blocks are almost completely marine. You could almost certainly ignore OV, for example. NW contains a little bit of Dumfries & Galloway (Portpatrick, NW995545), and a much larger but irrelevant part of Northern Ireland.

Find location from text

I am currently thinking of how to find a location from a text, such as a blogpost, without the user having to input any additional information. For example a post could look like this:
"Aberdeen, With a Foot on the Seafloor
Since the early 1970s, Aberdeen, Scotland, has evolved from a gritty fishing town into the world’s center of innovation in technology for the offshore energy industry."
By reading it I realize that the post is about Aberdeen Scotland but how can I geotag it? I have been using the geocoder (https://github.com/alexreisner/geocoder) by Alex Reisner but it seems weird to check every word against the google/nominatim(osm). My initial idea was to simply bruteforce it by checking every word with the geocoder and try to see if there are similarities between the words. But it seems like there could be a better way around this.
Has anyone done anything similar to this? Any algorithm that could be suggested (or gem :) ) would be immensely appreciated!
I'm sure there have been projects dedicated to this - for example, google's uncanny ability to geotag and pick data out of your personal emails effortlessly.
The most obvious answer I can see here, would be to create a few regular expressions for locations. The most simple one would be for City, Country:
Regexp.new("((?:[a-z][a-z]+))(.)(\\s+)((?:[a-z][a-z]+))",Regexp::IGNORECASE);
This would recognize Aberdeen, Scotland, but also course, I or even thanks, bye. It would be a start though, to query only those recognized spots instead of every word in the document.
There are also widely known regular expressions for addresses, cities, etc. You could use those as well if you find your algorithm missing matches.
Cheers!

How exactly do address lines work in QuickBooks?

Right now I'm only trying to read addresses and display them. Ignoring IPP right now, just inside QB, I'm not understanding the algorithm that manages the address lines.
Further, when accessing the customer address object via IPP, there are more differences, adding to my confusion. I'll call the three areas I'm looking at the freeform block, field block and IPP object. Here's an example where I typed the text into the field block and made the text match the field name:
The freeform block and IPP object took the City, State and Zip values and combined them into line 3. The IPP object has the Note value in Line 4. And the Country value ends up in the City field in IPP and field block.
Here's an example where I simply typed "line 1 ... line 5" in the freeform block:
Lines 1 - 4 look ok in the field block after the conversion, and put "line 5" into the City field. The IPP object is missing Line 4 field and value altogether.
Can someone share with us how this works? I'm trying to read these addresses and display them in my app in a consistent way.
I'm not familiar with Quickbooks. But I think you're looking for "address standardization" since you aren't sure in what format the address will come from Quickbooks.
Addresses are tricky (trust me, I work at SmartyStreets, where we have to be smart ... about streets) but there are services -- free and paid -- which will standardize addresses and put them into a consistent "componentized" format.
Take a look at LiveAddress API for starters... or you could use the batch/list service if you export your data into a file. Either way, it's free to use for a certain number of addresses.
(Tip: You can submit addresses for standardization and verification in two fields: "Street" and "Last Line" and still get good results -- so if you're not exactly sure where the city/state are, just put anything that's not the street address in the last line field.)

Parsing a full name into its constituents

We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.
I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.
Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm
The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.
Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.
Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser
I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }
A basic algorithm could do the following:
First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues
"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name.
Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.
As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...

Resources