Shipping and Billing Address normalization - normalization

I am currently normalizing the shipping and billing addresses on our software. So basically we have some areas where when orders are made, our system will check if the billing and shipping addresses match or not. If not then it is automatically adds to the fraud queues. The system works fine but the problem arises when users enters their address differently on shipping and billing... for example:
Shipping Address = "1209 9th Avenue Circle"
Billing Address = "1209 9th Ave. Circle"
So I used regular expression to replace Avenue to Ave on the addresses using full words so it is working in all of my test cases. I have done the similar things to following words.
'avenue' : 'ave',
'street' : 'st',
'boulevard': 'blvd',
'parkway': 'pkwy',
'highway': 'hwy',
'drive': 'dr',
'place': 'pl',
'expressway': 'expy',
'heights': 'hts',
'junction' : 'jct',
'center': 'ctr',
'circle' : 'cir',
'cove' : 'cv',
'lane' : 'ln',
'road' : 'rd',
'court' : 'ct',
'square' : 'sq',
'loop' : 'lp'
So I was wondering if I am missing any of the key words that are used in addresses or if someone could direct me to a link where I can find the list of Abbreviated words used in US addresses.
Thanks in advance.

http://pe.usps.gov/text/pub28/28apc_002.htm seems to have a pretty extensive list and is the official US Postal Services website

Related

Simple NER - IndexError: string index out of range error

Here is a simple example of Named Entity Recognition (NER) using the named entity recognition tool in the Natural Language Toolkit (nltk) library in Python:
import nltk
Input text
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
Tokenize the text
tokens = nltk.word_tokenize(text)
Perform named entity recognition
entities = nltk.ne_chunk(tokens)
Print the named entities
print(entities)
When I run this code in my Jupyter Notebook, I get this error.
"IndexError: string index out of range"
Am I missing any installation? Please advise.
Expected output:
(PERSON Barack/NNP Obama/NNP)
(GPE Hawaii/NNP)
(ORGANIZATION United/NNP States/NNPS)
nltk.ne_chunk expects its input to be tagged tokens rather than just plain tokens, so I would recommend adding a tagging step between the tokenization and ne chunking via nltk.pos_tag. ne chunking still would give you every token, chunked by entities if there are any detected. Since you want only the entities, you can check for if there is a tree in a particular chunk. Like the following:
text = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
entities = [chunk for chunk in nltk.ne_chunk(tagged_tokens) if isinstance(chunk, nltk.Tree)]
for entity in entities:
print(entity)
Please note that this code doesn't give exactly the output you want. Instead it gives:
(PERSON Barack/NNP)
(PERSON Obama/NNP)
(GPE Hawaii/NNP)
(GPE United/NNP States/NNPS)

Call Directory Extension CallKit does't recognize numbers with more than 9 digits

I'm working with CallKit and developing an app with a Call Directory Extension. I've followed this tutorial and I'm currently test the capability of identify numbers that the user does't have in his contacts and show an ID from my app, but although is working perfectly with numbers of 1 to 9 digits, for example 123456, when I set numbers with 10 or more digits, iOs doesn't recognize the number. After a day and a half of google it, I've have found no information about that. If anyone can help me I'll appreciate it. Thanks in advance.
The method for set the phone numbers for recognize:
private func addAllIdentificationPhoneNumbers(to context: CXCallDirectoryExtensionContext) {
// Retrieve phone numbers to identify and their identification labels from data store. For optimal performance and memory usage when there are many phone numbers,
// consider only loading a subset of numbers at a given time and using autorelease pool(s) to release objects allocated during each batch of numbers which are loaded.
//
// Numbers must be provided in numerically ascending order.
let allPhoneNumbers: [CXCallDirectoryPhoneNumber] = [ 123456789, 1_888_555_5555 ]
let labels = [ "ID test", "Local business" ]
for (phoneNumber, label) in zip(allPhoneNumbers, labels) {
context.addIdentificationEntry(withNextSequentialPhoneNumber: phoneNumber, label: label)
}
}
With this code, when I simulate a call with the number 123456789, iOS shows the tag "ID test" and that's correct, but if I add any digit, for example 0 at the end: 1234567890, iOS does't show anything when I simulate a call. I don't know if I'm missing something.
Well, after a bunch of tests I could made it work. The point was that the phone must contain the full country code and the area code. So for example 00_52_55_4567_8932 877 or +52_55_4567_8932 both will work. But 55_4567_8932 and 4567_8932 will not work. I hope this can help someone else in the future. Thank you all!

How to parse files with arbitrary lengths?

I have a text file that I'd like to parse with records like this:
===================
name: John Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Sun Java Certified Programmer
Age: 29
===================
name: Bob Bear
Education: High School Diploma
Age: 18
===================
name: Jane Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Master's Degree
Education: AWS Certified Solution Architect Professional
Age: 25
As you can see, the fields in such a text file are fixed, but some of them repeat an arbitrary number of times. The records are separated by a fixed length ==== delimiter.
How would I write parsing logic this this sort of problem? I am think of using switch as it reads the start of the line, but the logic to handle multiple repeating fields baffles me.
A good way to approach this sort of problem is to "divide and conquer". That is, divide the overall problem into smaller sub-problems which are easier to manage and then solve each them individually. If you've planned properly then when you've finished each of the sub-problems you should have solved the whole problem.
Start by thinking about modeling. The document appears to contain a list of records, what should those records be called? What named fields should the records contain and what types should they have? How would you represent them idiomatically in go? For example, you might decide to call each record a Person with fields as such:
type Person struct {
Name string
Credentials []string
Age int
}
Next, think about what the interface (signature) of your parse function should look like. Should it emit an array of people? Should it use a visitor pattern and emit a person as soon as it's parsed? What constraints should drive the answer? Are memory or compute time constraints important? Does the user of the parser want any control over the parsing work such as canceling? Do they need metadata such as the total number of records contained in the document? Will the input always be from a file or a string, maybe from an HTTP request or a network socket? How will these choices drive your design?
func ParsePeople(string) ([]Person, error) // ?
func ParsePeople(io.Reader) ([]Person, error) // ?
func ParsePeople(io.Reader, func visitor(Person) bool) error // ?
Finally you can implement your parser to fulfill the interface that you've decided on. A straightforward approach here would be to read the input file line-by-line and take an action according to the contents of the line. For example (in pseudocode):
forEach line = inputFile.line
if line is a separator
emit or store the last parsed person, if present
create a new person to store parsed fields
else if line is a data field
parse the data
update the person with the parsed data
end
end
return the parsed records or final record, if emitting
Each line of pseudocode above represents a sub-problem that should be easier to solve than the whole.
Edit: Add explanation of why I just post a program as answer.
I am presenting a very straight forward implementation to parse the text you have given in your question. You accepted maerics answer and that is OK. I want to add some counter arguments to his answer, though. Basically the pseude-code in that answer is a non-compilable version of the code in my answer so we agree on the solution to this.
What I do not agree with is the over-engineering talk. I have to deal with code written by over-thinkers everyday. I urge you NOT to think about patterns, memory and time constraints or who might want what from this in the future.
Visitor pattern? That is something that is pretty much only useful in parsing programming languages, do not try to construct a use-case for it out of this problem. The visitor pattern is for traversing trees with different types of things in it. Here we have a list, not a tree, of things that are all the same.
Memory and time constraints? Are you parsing 5 GB of text with this? Then this might be a real concern. But even if you do, always write the simplest thing first. It will suffice. Throughout my career I only ever needed to use something other then a simple array or apply a complicated algorithm at most once per year. Still I see code everywhere that uses complicated data structures and algorithms without reason. This complicates change, is errorprone, sometimes makes things slower eventually! Do not use an observable list abstraction that notifies all observers whenever its contents change - but wait, let's add an update lock and unlock so we can control when to NOT notify everybody... No! Do not go down that route. Use a slice. Do your logic. Make everything read easy from top to bottom. I do not want to jump from A to B to C, chasing interfaces, following getters to finally find not a concrete data type but yet another interface. That is not the way to go.
These are the reasons why my code does not export anything, it is a self-contained, runnable example, a concrete solution to your concrete problem. You can read it, it is easy to follow. It is not heavily commented because it does not need to be. The three comments are not stating what happens but why it happens. Everything else is evident from the code itself. I left the note about the potential error in there on purpose. You know what kind of data you have, there is no line in there where this bug would be triggered. Do not write code to handle what cannot happen. If in the future someone would add a line without a text after the colon (remember, nobody will ever do this, do not worry about it), this will trigger a panic, point you to this line, you add another if or something, you are done. This code is future proof more then a program that tries to handle all kinds of different non-existent variations of the input.
The main point that I want to stretch is: write only what is necessary to solve the problem at hand. Everything beyond that makes your program hard to read and change, it will be untested and unnecessary.
With that said, here is my original answer:
https://play.golang.org/p/T6c51jSM5nr
package main
import (
"fmt"
"strconv"
"strings"
)
func main() {
type item struct {
name string
educations []string
age int
}
var items []item
var current item
finishItem := func() {
if current.name != "" { // handle the first ever separator
items = append(items, current)
}
current = item{}
}
lines := strings.Split(code, "\n")
for _, line := range lines {
if line == separator {
finishItem()
} else {
colon := strings.Index(line, ":")
if colon != -1 {
id := line[:colon]
value := line[colon+2:] // note potential bug if text has nothing after ':'
switch id {
case "name":
current.name = value
case "Education":
current.educations = append(current.educations, value)
case "Age":
age, err := strconv.Atoi(value)
if err == nil {
current.age = age
}
}
}
}
}
finishItem() // in case there was no separator at the end
for _, item := range items {
fmt.Printf("%s, %d years old, has educations:\n", item.name, item.age)
for _, e := range item.educations {
fmt.Printf("\t%s\n", e)
}
}
}
const separator = "==================="
const code = `===================
name: John Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Sun Java Certified Programmer
Age: 29
===================
name: Bob Bear
Education: High School Diploma
Age: 18
===================
name: Jane Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Master's Degree
Education: AWS Certified Solution Architect Professional
Age: 25`

R - how to exclude pennystocks from environment before calculating adjusted stock returns

Within my current research I'm trying to find out, how big the impact of ad-hoc sentiment on daily stock returns is.
Calculations functioned quite well and results also are plausible.
The calculations until now with quantmod package and yahoo financial data look like below:
getSymbols(c("^CDAXX",Symbols) , env = myenviron, src = "yahoo",
from = as.Date("2007-01-02"), to = as.Date("2016-12-30")
Returns <- eapply(myenviron, function(s) ROC(Ad(s), type="discrete"))
ReturnsDF <- as.data.table(do.call(merge.xts, Returns))
# adjust column names
colnames(ReturnsDF) <- gsub(".Adjusted","",colnames(ReturnsDF))
ReturnsDF <- as.data.table(ReturnsDF)
However, to make it more robust towards noisy influence of pennystock data I wonder, how its possible to exclude stocks that once in the time period go below a certain value x, let's say 1€.
I guess, the best thing would be to exclude them before calculating the returns and merge the xts object results or even better, before downloading them with the getSymbols command.
Has anybody an idea how this could work best? Thanks in advance.
Try this:
build a price frame of the Adj. closing prices of your symbols
(I use the PF function of the quantmod add-on package qmao which has lots of other useful functions for this type of analysis. (install.packages("qmao", repos="http://R-Forge.R-project.org”))
check by column if any price is below your minimum trigger price
select only columns which have no closings below the trigger price
To stay more flexible I would suggest to take a sub period - let’s say no price below 5 during the last 21 trading days.The toy example below may illustrate my point.
I use AAPL, FB and MSFT as the symbol universe.
> symbols <- c('AAPL','MSFT','FB')
> getSymbols(symbols, from='2018-02-01')
[1] "AAPL" "MSFT" "FB"
> prices <- PF(symbols, silent = TRUE)
> prices
AAPL MSFT FB
2018-02-01 167.0987 93.81929 193.09
2018-02-02 159.8483 91.35088 190.28
2018-02-05 155.8546 87.58855 181.26
2018-02-06 162.3680 90.90299 185.31
2018-02-07 158.8922 89.19102 180.18
2018-02-08 154.5200 84.61253 171.58
2018-02-09 156.4100 87.76771 176.11
2018-02-12 162.7100 88.71327 176.41
2018-02-13 164.3400 89.41000 173.15
2018-02-14 167.3700 90.81000 179.52
2018-02-15 172.9900 92.66000 179.96
2018-02-16 172.4300 92.00000 177.36
2018-02-20 171.8500 92.72000 176.01
2018-02-21 171.0700 91.49000 177.91
2018-02-22 172.5000 91.73000 178.99
2018-02-23 175.5000 94.06000 183.29
2018-02-26 178.9700 95.42000 184.93
2018-02-27 178.3900 94.20000 181.46
2018-02-28 178.1200 93.77000 178.32
2018-03-01 175.0000 92.85000 175.94
2018-03-02 176.2100 93.05000 176.62
Let’s assume you would like any instrument which traded below 175.40 during the last 6 trading days to be excluded from your analysis :-) .
As you can see that shall exclude AAPL and FB.
apply and the base function any applied(!) to a 6-day subset of prices will give us exactly what we want. Showing the last 3 days of prices excluding the instruments which did not meet our condition:
> tail(prices[,apply(tail(prices),2, function(x) any(x < 175.4)) == FALSE],3)
FB
2018-02-28 178.32
2018-03-01 175.94
2018-03-02 176.62

Parse Phone Number into component parts

I need a well tested Regular Expression (.net style preferred), or some other simple bit of code that will parse a USA/CA phone number into component parts, so:
3035551234122
1-303-555-1234x122
(303)555-1234-122
1 (303) 555 -1234-122
etc...
all parse into:
AreaCode: 303
Exchange: 555
Suffix: 1234
Extension: 122
None of the answers given so far was robust enough for me, so I continued looking for something better, and I found it:
Google's library for dealing with phone numbers
I hope it is also useful for you.
This is the one I use:
^(?:(?:[\+]?(?<CountryCode>[\d]{1,3}(?:[ ]+|[\-.])))?[(]?(?<AreaCode>[\d]{3})[\-/)]?(?:[ ]+)?)?(?<Number>[a-zA-Z2-9][a-zA-Z0-9 \-.]{6,})(?:(?:[ ]+|[xX]|(i:ext[\.]?)){1,2}(?<Ext>[\d]{1,5}))?$
I got it from RegexLib I believe.
This regex works exactly as you want with your examples:
Regex regexObj = new Regex(#"\(?(?<AreaCode>[0-9]{3})\)?[-. ]?(?<Exchange>[0-9]{3})[-. ]*?(?<Suffix>[0-9]{4})[-. x]?(?<Extension>[0-9]{3})");
Match matchResult = regexObj.Match("1 (303) 555 -1234-122");
// Now you have the results in groups
matchResult.Groups["AreaCode"];
matchResult.Groups["Exchange"];
matchResult.Groups["Suffix"];
matchResult.Groups["Extension"];
Strip out anything that's not a digit first. Then all your examples reduce to:
/^1?(\d{3})(\d{3})(\d{4})(\d*)$/
To support all country codes is a little more complicated, but the same general rule applies.
Here is a well-written library used with GeoIP for instance:
http://highway.to/geoip/numberparser.inc
here's a method easier on the eyes provided by the Z Directory (vettrasoft.com),
geared towards American phone numbers:
string_o s2, s1 = "888/872.7676";
z_fix_phone_number (s1, s2);
cout << s2.print(); // prints "+1 (888) 872-7676"
phone_number_o pho = s2;
pho.store_save();
the last line stores the number to database table "phone_number".
column values: country_code = "1", area_code = "888", exchange = "872",
etc.

Resources