Cleaning data in SPSS with name misspellings - spss

I have a 5M records dataset in this basic format:
FName LName UniqueID DOB
John Smith 987678 10/08/1976
John Smith 987678 10/08/1976
Mary Martin 567834 2/08/1980
John Smit 987678 10/08/1976
Mary Martin 768987 2/08/1980
The DOB is always unique, but I have cases where:
Same ID, different name spellings or Different ID, same name
I got as far as making SPSS recognize that John Smit and John Smith with the same DOB are the same people, and I used aggregate to show how many times a spelling was used near the name (John Smith, 10; John Smit 5).
Case 1:
What I would like to do is to loop through all the records for the people identified to be the same person, and get the most common spelling of the person's name and use that as their standard name.
Case 2:
If I have multiple IDs for the same person, take the lowest one and make that the standard.
I am comfortable using basic syntax to clean my data, but this is the only thing that I'm stuck on.

If UniqueID is a real unique ID of individuals in the population and you are wanting to find variations of name spellings (within groupings of these IDs) and assign the modal occurrence then something like this would work:
STRING FirstLastName (A99).
COMPUTE FirstLastName = CONCAT(FName," ", LName").
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID FirstLastName /Count=N.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID /MaxCount=MAX(Count).
IF (Count<>MaxCount) FirstLastName =$SYSMIS.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES OVERWRITE=YES /BREAK=UniqueID /FirstLastName=MAX(FirstLastName).
You could then also overwrite the FName and LName fields also but then more assumptions would have to be made, if for example, FName or LName can contain space characters ect.

Related

How to search a string for multiple multi-word phrases in Swift or Objective-C

I want to parse a large number of strings for canned phrases or names and then store the names, if found, in an array where the order counts.
So for example, starting with a string such as:
str = "The movie stars Robert Duvall and James Earl Jones and pits them against a villain played expertly by Brando in an action packed adventure."
I would like to search against an array of actors:
names = [Robert Duvall, Henry Fonda, Brando, Marlon Brando, Jane Fonda, James Earl Jones, Peter Fonda, Montgomery Clift] etc where the actors can have one, two or three names.
Initially, I could simply check for a match on the triples using strpos or convert the string to triples and do a match on triples as in James Earl Jones. Then I could remove his name and search the remainders for other doubles or individual words. However, this approach starts to get very complicated quickly and I'm wondering if there isn't a more elegant approach.
//This road looks very messy indeed...
NSArray *triples = [self getTriples:str];//get all combinations of three sequential words
NSArray *pieces = [NSMutableArray new];
NSMutableArray * matches = [NSMutableArray new];
for (long i = 0;i<[triples count];i++) {
NSString *phrase = triples[i];
for (long j = 0;j<[names count];j++) {
NSString *name = names[j];
if ([phrase caseInsensitiveCompare:name]==NSOrderedSame) {
[matches addObject:phrase];
//Rumps has two elements, before and after
rumps = [str componentsSeparatedByString:phrase];
NSString *start = rumps[0];
NSString *end = rumps[1];
//Search before for a name
//search after for a name
}
}
}//end triples
Thanks for any suggestions.
Here is an idea based on your names string.
Split names on comma and and store in array say a1
Loop through a1 and see if you have any match on full name
If not, loop a1 again and split on spaces into a2
Here I am not that clear on your logic, but maybe like so? Now in this inner loop, where you loop a2
If a2 has three elements / names then you assume no match? Or you can check all possible combinations, not too bad for just 3 (123 already checked, then 132, 213, 231, 312, 321 and you're done with 3 names).
If it has two elements only check in reverse (21, you already checked 12).
If still no match you can check on the individual elements of a2 if that is what you want, so check on 1, 2 (and possibly 3).
Any match you use the corresponding a1 element - which is what you want, the full name, right?
You can use an index set and set the index into a1 - the actor you found to prevent dups.
Here is one possible algorithm sketch, there will be no real code – indeed as I write this it has not been written in Objective-C or Swift, it is an algorithm which can be implemented in both (and other) languages.
In coding it you may find the algorithm missed something (i.e. there could be errors, this was written directly into the answer, it is a sketch!), in which case go back and refine the algorithm and repeat.
Our sample name list:
James Earl Jones, James, Marlon Brando, Earl Jones, Brando, James Earl
and sample text:
James, James Earl and James Earl Jones all regular meet for coffee
The algorithm is based around the observations:
[Note: In the description we assume left-to-right text and that search for a match moves left-to right. The algorithm will work for right-to-left with simple adjustments, for mixed direction text it will get messier!]
Matches cannot overlap. E.g. "James Earl" is not both "James" and "James Earl". We say a match consumes the test.
Only names which are prefixes of another one need care, ones which are *postfixes" do not. E.g. If looking for "James" and "James Earl" you must look for the latter first to avoid getting a match on "James" and then missing "James Earl" as a match on "James" has consumed those characters. However "Earl Jones" and "James Earl Jones" can be searched for at the same time, the latter will match first.
In a collection of names which do not contain any prefixes they can all be matched at the same time using a regular expression with alternates. E.g. "James Earl Jones" and "Earl Jones" can be matched by the RE "James Earl Jones|Earl Jones"
When you have prefixes, so you search for the longer first, a match for a shorter name can only occur to the left of the match for the longer one.
The algorithm uses regular expression matching, as provided by NSRegularExpression in Objective-C & Swift; and ranges, as provided by NSRange, which allow searching in part of a string.
The outline:
Sort you names. E.g.:
Brando, Earl Jones, James, James Earl, James Earl Jones, Marlon Brando
Divide your names into two lists by removing any name which is a prefix of its immediately following name and placing into a second list. E.g.
Brando, Earl Jones, James Earl Jones, Marlon Brando
James, James Earl
If the second list is not empty repeat step (2) producing a third list, keep repeating until no prefixes have been removed. E.g. our sample names produce the 3 lists:
Brando, Earl Jones, James Earl Jones, Marlon Brando
James Earl
James
Convert each list to a regular expression using alternation to produce a list of regular expressions to use in searching. E.g.:
"Brando|Earl Jones|James Earl Jones|Marlon Brando", "James Earl", "James"
(At this point we realise the sample names could have been better as only the first RE required alternation. Oh well...)
Now we are ready to use our prepared regular expressions to find the matches.
Set the search range to the whole text, the match range to be empty/no value.
Set the current RE to the first
Using the current RE do a search for the first match within the search range to produce a match range. If there is no new match goto (9). E.g. using our sample, where the match range is indicated by []'s:
James, James Earl and [James Earl Jones] all regular meet for coffee
Set the new search range to be from the start of the current search range to the end of the match range, advance the current RE, goto (6). E.g. the sequence of matches for the first name goes:
James, James Earl and [James Earl Jones] all regular meet for coffee
James, [James Earl] and James Earl Jones
[James], James Earl
We now have our first matching range, record it, set the new search range to be from the end of the matched range to the end of the text, and if this new search range is non-empty goto 6.
Done, we have the list of matches.
If you don't want the list of actual matches but just a collection of unique matches then accumulate a set (e.g. NSMutableSet/Set) of matches as you go.
Have fun coding (and refining, coding...) the algorithm. If you get stuck ask a new question, reference this Q&A, describe your algorithm as it is then, show your implementation, detailed your problem, etc. and someone will undoubtedly help you along. HTH.

Core Data Find Single Match on Two Attributes

This use case should be reasonably common but I can't seem to figure it out or find any ways to solve it on the web. I would like to search an employee database using natural language, ie an unstructured word string.
Let's say you have a core data entity with three attributes as follows:
firstname string
lastname string
Employee_id string
And you have the following managed objects
Jane | Smith | 1
Jane | Smiley | 2
Jane | Doe | 3
John | Doe | 4
Richard | Doe | 5
How would you get the employee-id for the string 'Jane Doe' or 'Doe Jane' or 'Doe Jane accounting'?
If you knew for sure that there were just two words and the order was firstname, second name then you could do.
NSArray *words = [string componentsSeparatedByString:#" "];
NSPredicate *pred = [NSPredicate predicateWithFormat:#"firstname beginswith[cd] %# AND lastname beginswith[cd]", words[0],words[1]];
But in this case, I don't know the order or number of words.
Thanks in advance for any suggestions.
You can use NSCompoundPredicate and put a bunch of NSPredicates with the various ordering possibilities like the one you made in your example, but that will obviously be limited to the number of words you want to write predicate combinations for.
(probably obviously, but you are creating a series of predicates like: ( (stringA and string B) or (stringB and stringA), or (stringA and stringC) or (stringC and stringA) or (stringB and stringC) or (stringC and stringB) ).
You can create these predicates relatively cleanly by writing a predicate with variables and then using predicateWithSubstitutionVariables: repeatedly with different dictionary of variable -> word mappings to get the various arrangements.
The trick is, at some point you are trying to do free form full text searching across structured data without a full text index. Here's a decent blog post (though old) on the challenges of doing that.
An alternative is to design your user interface to get the user to you the data in an easier form to deal with. For example, give the user a form for their query with the valid fields to fill in. Or at least direct them with prompt text for your open entry single text field with something like "Enter the person's first and last name" or whatever makes sense.

Calculating self citation counts in DBLP using neo4j

I have imported the DBLP database with referenced publications from Crossref API into neo4j.
The goal is to calculate a self-citation-quotient for each author in the database.
The way I´d like to calculate this quotient is the following:
find authors that have written publications referencing another publication written by the same author
for each of these publications count the referenced publications written by the same author
divide amount of self references by the amount of all references
set this number as a parameter scq(self citation quotient) for the publication
sum all values of scq and divide them by the total amount of publications written by the author
set this value as a property scq for the Author
As an example I have the following sub-graph for the author "Danielle S. Bassett":
From the graph you can see that she has 2 publications that contain self-references.
In Words:
Danielle wrote Publication 1, 2, 3, 4
Publication 1 references publication 2
Publication 3 references publication 4
My attempt was to use the following cypher query:
match (a:Author{name:"Danielle S. Bassett"})-[:WROTE]->(p1:Publication)-[r:REFERENCES]->(p2:Publication)<-[:WROTE]-(a)
with count(p2) as ssc_per_publ,
count(p1) as main_publ_count,
collect(p2) as self_citations,
collect(p1) as main_publ,
collect(r) as refs,
a as author
return author, main_publ, ssc_per_publ, self_citations, main_publ_count, refs
The result of this query as a table looks like this:
As you can see from the table the main_publ_count is calculated correctly since there are 2 publications she has written that contain self references but the ssc_per_publ (self citation count per publication) is wrong because it counted ALL self references. But I need the count of self references for EACH PUBLICATION.
Calculating the quotients will not be the problem but getting the right values from neo4j is.
I hope I´ve expressed myself clearly enough for you to understand the issue.
Maybe someone of you knows a way of getting this right. Thanks!
Your WITH clause is using author as the sole aggregation function "grouping key", since it is the only term in that clause not using an aggregation function. So, all the aggregation functions in that clause are aggregating over just that one term.
To get a "self citation count" per publication (by that author), you'd have to do something like the following (for simplicity, this query ignores all the other counts and collections). author and publ together form the "grouping key" in this query.
MATCH (author:Author{name:"Danielle S. Bassett"})-[:WROTE]->
(publ:Publication)-[r:REFERENCES]->(p2:Publication)<-[:WROTE]-(a)
RETURN author, publ, COUNT(p2) as self_citation_count;
[Aside: your original query has other issues as well. For example, you should use COUNT(DISTINCT p1) as main_publ_count so that multiple self-citations to the same p1 instance will not inflate the count of "main" publications.]

Summary of Results Based on Members of File

I am not quite sure how to word this so I've also included some poorly formatted example :) Basically I have a report exported from Cognos. The report contains a list of cases and the people that are associated to those cases, along with additional information about their First Language and Religion (as an example). What I would like to do is create a summary and/or chart of the results based on the unique case.
Any ideas? Example data below:
Case Reference - Name - First Language - Religion
1234 - Name1 - English - Catholic
1234 - Name2 - French - Protestant
4321 - Name3 - Tamil - Unknown
3345 - Name4 - English - Hindu
So for a summary I'd like to see that for languages there is 1 for Tamil and 1 for French (English would be the default if no other languages are present - so for file 1234 it would have been English if there was no French speaking person). For religions I'd like to be able to see that out of the 3 files, 1 is unknown, 1 is Hindu and also that the 3rd file is actually 2 religions (Catholic and Protestant).
I am not sure if any of this is making sense but hopefully one of you can shed some light on a possible solution. I'd like to template it out so that on line one of the case it would have an x under each heading, but do it automatically instead of manually. Basically, for each unique case are there any members that are French, any that are Tamil, any that are Catholic, any that are Christian, etc...
Thanks!
I hope I'm following correctly. It seems you want to show for each language, how many cases they are associated with and for every case, how many religions are associated with it.
For language, add a column to your report's query called Language Count with the following expression:
count(distinct [Case Reference] for [First Language])
This will count the number of unique cases for each language.
For religions, add a column to your report's query called Religion Count with the following expression:
count(distinct [Religion] for [Case Reference])
This will count the number of unique religions for each case.

Kettle: ETL Normalization split string fields

I have database where it has one attribute that actually carries two(string separated by "; "). Take a look on the following example:
Example
This is my Database A (Source) which has a table like this:
In fact, this seems all ok, but when you assume that the author can have one or more entrance. you will have a record like this:
document (id 1, author "John Matt; Mary Knight", abstract "Lorem ipsum...", year 2015)
So what i intend to do on Database B(Target) is something like this:
where a_id from table Authors is a foreign key that references author_id on table Document.
In first place ensure that i will fill all the authors (which there's no problem with that) and then assign the group authors to the respective document(which is the problem) because i can have this situation
Authors (id 1, name "John Matt")
(id 2, name "John Matt")
(id 2, name "Mary Knight")
Notice that the id 2 will be the one to be inserted attending to the previous example.
Question:
How this procedure can be done using a ETL process using kettle?
Is this a good pratice or a string attribute separated by "; " it's good enough?
If i have understood your question correctly, you have a database having rows like
document (id 1, author "John Matt; Mary Knight", abstract "Lorem ipsum...", year 2015)
Now you need to extract the multiple authors which are separated by ;.
PDI provides you with a step called Split fields to row. Use the separator as ; and you will get multiple split copy of the data. Next you can then use an unique id from the author table.
Alternatively, you can use Modified Java Script step to split the rows also. Recommend you to use the first step.
Hope this helps :)

Resources