Normalizing addresses in UK - normalization

I have been going through different sites like crunchy data solutions and this python project but all of them seems to be focused in US. I would like to know if someone knows the best way to normalize address for matching criteria coming from different resources. I have been researching in old questions but I couldn't find any answer or question focused in UK.

At the moment the closest I am is for a single line address concatenate all the fields address realted for then:
UPDATE table
SET concatenate_fields = replace(concatenate_fields, ',', '');
this provided a single line not separated by commas then:
UPDATE table
SET concatenate_fields = upper (concatenate_fields);
This provided a single line all in upper cases and not commas. I have done the same for all the tables which I need to match for the address till having some mismatch.

Related

How to OCR scanned voting protocols

As part of a hobby project I'm trying to digitalise all the voting records of the Swedish parliament to see if I can extract any interesting statistics (yes a strange hobby I know).
From 1983 to 2001 the voting records look something like in the example. They are printed from some kind of voting machine and only exist on paper (that are now scanned and on my disk).
Every vote consists of three pages with two columns each of members and votes as in the example.
Some translations from Swedish: (Plats = Seat, Ledamöter = Members, Parti = party, Röst = Vote).
The list is sorted on the party column and then alphabetically on the member column.
After an election a member stays on their seat until next election or that the member quits parliament and is replaced by a replacement member. There can also be temporary replacements.
Replacements are also sorted into the list.
The Parti/Party column contains the party abbreviation and can only be one of nine letters (since there are only nine different parties in that time period). Members stay with their party but can technically change between or during election periods.
The Röst/vote column can be one of J,N,A,F (Ja = Yey, Nej = Ney, Avstår = Pass, Frånvarande = Absent) and are also aligned in a four sub-columns.
The columns and rows are not always in the same place in the picture. It can both be slightly translated and/or rotated. The quality of the scan is also not always this good.
At the top of the first page (not shown i the example) there is a summation of votes per party that can be checked against.
There are about 18.000 votes in total.
For votes before to 1983 the layout was different and I was able to make a custom program in node.js (although most languages would be ok for me) that semi-automatically could scan the votes but this looks like something that should be easy to do with tesseract or something similar.
My question is really if its possible to hint tesseract about the layout so that it can do some better guesses of what the text is. I'm aware one can make a custom wordlist (where I could for example add all members names manually).
I'm guessing that there might be a way to make a custom pattern list but I haven't figured that out.
Does anyone have any good suggestions on how to tackle this?

Transform comma separated google form answers to multiple lines in spreadsheet

I have made a google form to which some answers are formatted as comma separated strings inside the automatically populated google spreadsheet. I would like to read from this sheet to another sheet and reformat the answers so that each comma separated answer is shown on a new row. I have tried to apply an ARRAYFORMULA that reads from the original sheet and then use a solution that uses SPLIT and TRANSPOSE the cell content, however combined with the ARRAYFORMULA this fails since it would overwrite contents in other cells.
Here is an example spreadsheet with the responses, a solution sheet, and a desired results sheet. https://docs.google.com/spreadsheets/d/1r_l5fVJ9lGfpubO2o3pXicV7JlZWmANjwSgNi7_DL0A
Any suggestions for how I can achieve the end result?
Okay, I assume this isn't really what you want, but visually it looks okay...
Try this formula:
={{'Form responses'!A2:A3},ArrayFormula(regexreplace('Form responses'!B2:E3,", ",CHAR(10)))}
Then format the cells so that the cell contents are TOP-aligned, instead of the default BOTTOM-aligned.
Realistically, I imagine that you want each question answer split into multiple cells. But if your data responses really contain letter values separated by commas, as you've indicated, you can still search through those cells to find whether an answer contains a certain value. It all depends on why you want the results structured the way you do.
If you can clarify what you want to do with the form results, instead of just appearing vertically for each question, perhaps we can provide a full solution for that requirement?
UPDATE1:
Okay, I may be getting close. I can get your data transformed to look like the following:
This would let you do the analysis that you want, by searching for Q.1 (question 1 responses) in the first column, and then all the answers in the third column, along with the owner in column 2. And from this, it will also definitely be possible to put the results in the exact form you want. It just may take an intermediate step.
UPDATE2:
Okay, I think I have something you can use. I can convert your data to either of the following two layouts.
The one on the right is closest to what you asked for, with the exception that the answers on the right are bottom aligned, with blanks above. But you can still process them for analysis, with queries. I honestly think having the user identifier (email address) on each row would make things simpler, but I can provide it either way.
The layout on the left is more of a traditional database layout, and would make analysis very simple. Each row has the date and email identifiers, the question number, and the answer (or one of the answers) to that question, from that user.
If this is helpful, it might be best if you enabled your sample sheet to allow us to edit it, to enable me to implement it in your sheet. But here is my sample sheet, in case anyone wants to look through it. Note that the main formula to reformat the data, in Solution!B3, could benefit from a lot of cleanup, and is probably nowhere near the best way to achieve this. Just throwing up one possible solution...
I'll try to add some explantion for the formula at some point, but ask if you have any questions.

Check Similarity String Between Column In Google Sheet

I have a table in the link below:
https://docs.google.com/spreadsheets/d/1EOALaBVzHijUP_8dM1Sr7KTutdTah8b9Q0xDRoNHBLo/edit#gid=0
if the text is split first, then check what do you do? for example "Kebumen District Office" Vs "District Head Office of Kebumen District" Then we need 7x7 columns = 49 columns because we will match for each word words 1-1, 1-2, 1-3, 1-4, 2-1, 2-2.2-3.2-4, etc.
The text in column B is split and then checked for each word with the text in column A. If in column B there are many different words found the text is not similar.
Only I am still confused to make the formula. Please give me the solution sir. Thanks.
The matching patterns are very different in your case and I see no solution based on formulas (regular expressions).
You may need to find articles about fuzzy vlookup.
Here's what I found for google sheets (not tested):
Addon, find fuzzy matches
This problem is common for Excel, there're solutions based on vba.
As I said, the one formula won't solve your task because you have many cases. First example Mc Donald vs McDonald is checked easily with a formula:
= substitute(A, " ", "") = substitute(B, " ", "")
Your next samples are different. You may use some code, but even this won't give the expected results. My suggestion: split the task into small cases and try to solve them separately. Make an investigation or ack a new question for each case.
Your second and 3-d lines are case2. In this case, you need to check all the words in A are also in B. You'll need to try solving it and ask another question if needed. And so on.
Fuzzy matching is definitely the way to go. Different algorithms have different strengths and weaknesses. My suggestion is that you visit the G Suite marketplace and look for Flookup or simply follow this link:
Flookup for Google Sheets
It'll allow you to look for matches ranging from 0% to 100% similarity. The basic formula is:
FLOOKUP(lookupValue, tableArray, lookupCol, indexNum, [threshold], [rank])
Find out more from the official website.
Edit: I'm the creator of Flookup.

SPSS Frequency Plot Complication

I am having a hard time generating precisely the frequency table I am looking for using SPSS.
The data in question: cases (n = ~800) with categorical variables DX_n (n = 1-15), each containing ICD9 codes, many of which are the same code. I would like to create a frequency table that groups the DX_n variables such that I can view frequency of every diagnosis in this sample of cases.
The next step is to test the hypothesis that the clustering of diagnoses in this sample is different than that of another. If you have any advice as to how to test this, that would be really appreciated as well!
Thanks!
Edit: My attempts:
1) Analyze -> Descriptive Statistics -> Frequencies; then add variables DX_n (1-15) and display frequency charts. The output is frequencies of each ICD9 code per DX_n variable (so 15 tables are generated - I'm hoping to just have one grouped table).
2) I tried adjusting the output format to organize by variable and also to compare variables but neither option gives the output I'm looking for.
I think what you are looking for CTABLES. It can do parallel columns of frequencies, and it includes a column proportions test that can see whether the distributions differ
Thank you, JKP! You set me on exactly the right track. I'm not sure how I overlooked that menu. Just to clarify in case anyone else comes along needing to figure this out:
Group diagnosis variables into a multiple response set using Analyze > Custom Tables > Multiple Response Sets. Code the variables as categories.
http:// i.imgur.com/ipE9suf.png
Create a custom table with your new multiple response set as a row and the subsets to compare as columns. I set summary statistics to compute from rows and added the column n% column (sorted descending).
http:// i.imgur.com/hptIkfh.png
Under test statistics, include a column proportions z-test as JKP suggested.
http:// i.imgur.com/LYI6ZRl.png
Behold, your results:
http:// i.imgur.com/LgkBA8X.png
Thanks again, and best of luck to anyone else who runs across this.
-GCH
p.s. Sorry everyone, I was going to post images but don't have enough reputation points yet. Images detailing the steps in the GUI can be found at the obfuscated links above.

User input parsing - city / state / zipcode / country

I'm looking for advice on parsing input from a user in multiple combinations of City / State / Zip Code / Country.
A common example would be what Google maps does.
Some examples of input would be:
"City, State, Country"
"City, Country"
"City, Zip Code, Country"
"City, State, Zip Code"
"Zip Code"
What would be an efficient and correct way to parse this input from a user?
If you are aware of any example implementations please share :)
The first step would be to break up the text into individual tokens using spaces or commas as the delimiting characters. For scalability, you can then hand each token to a thread or server (if using a Map-Reducer like architecture) to figure out what each token is. For instance,
If we have numbers in the pattern, then it's probably a zip code.
Is the item in the list of known states?
Countries are also fairly easy to handle like states, there's a limited number.
What order are the tokens in compared to the common ways of writing an address? Most input will probably follow the local post office custom for address formats.
Once you have the individual token results, you can glue the parts back together to get a full address. In the cases where there are questions, you can prompt the user what they really meant (like Google maps) and add that information to a learned list.
The easiest method to add that support to an applications, assuming you're not trying to build a map system, is to query Google or Yahoo and ask them to parse the date for you.
I am myself very fascinated with how Google handles that. I do not remember seeing anything similar anywhere else.
I believe, you try to separate an input string in words trying various delimeters - space, comma, semicolon etc. Then you have several combinations. For each combination, you take each words and match it against country, city, town, postal code database. Then you define some metric on how to evaluate the group match result for each combination. Here should also be cross rules, like if the postal code does not match well, but country, city, town match well and in combination refer to a valid address then the metric yields a high mark.
It is sure difficult and not an evening code exercise. It also requires strong computational resources - a shared hosting would probably crack under just 10 requests, but a data center could serve it well.
Not sure if there is an example implementation. Many geographical services are offered on paid basis. Something that sophisticated as GoogleMaps would likely cost a fortune.
Correct me if I'm wrong.
I found a simple PHP implementation
http://www.eotz.com/2008/07/parsing-location-string-php/
Yahoo seems to have a webservice that offers the functionality (sort of)
http://developer.yahoo.com/geo/placemaker/
Openstreetmap seems to offer the same search functionality on its homepage
http://www.openstreetmap.org/
Assuming you're only dealing with those four fields (City Zip State Country), there are finite values for all fields except for City, and even that I guess if you have a big city list is also finite. So just split each field by comma then check against each field list.
Assuming we're talking US addresses-
Zip is most obvious, so check for
that first.
State has 50x2 options
(California or CA), check that next
Country has ~190x2 options, depending
on how encompassing you want to be
(US, United States, USA).
Whatever is left over is probably your City.
As far as efficiency goes, it might make sense to check a handful of 'standard' formats first, like Dan suggests.

Resources