Kettle: ETL Normalization split string fields - normalization

I have database where it has one attribute that actually carries two(string separated by "; "). Take a look on the following example:
Example
This is my Database A (Source) which has a table like this:
In fact, this seems all ok, but when you assume that the author can have one or more entrance. you will have a record like this:
document (id 1, author "John Matt; Mary Knight", abstract "Lorem ipsum...", year 2015)
So what i intend to do on Database B(Target) is something like this:
where a_id from table Authors is a foreign key that references author_id on table Document.
In first place ensure that i will fill all the authors (which there's no problem with that) and then assign the group authors to the respective document(which is the problem) because i can have this situation
Authors (id 1, name "John Matt")
(id 2, name "John Matt")
(id 2, name "Mary Knight")
Notice that the id 2 will be the one to be inserted attending to the previous example.
Question:
How this procedure can be done using a ETL process using kettle?
Is this a good pratice or a string attribute separated by "; " it's good enough?

If i have understood your question correctly, you have a database having rows like
document (id 1, author "John Matt; Mary Knight", abstract "Lorem ipsum...", year 2015)
Now you need to extract the multiple authors which are separated by ;.
PDI provides you with a step called Split fields to row. Use the separator as ; and you will get multiple split copy of the data. Next you can then use an unique id from the author table.
Alternatively, you can use Modified Java Script step to split the rows also. Recommend you to use the first step.
Hope this helps :)

Related

Reordering text in DOORS layout column using DXL

I have seen this question asked for numbers, but my layout column consists of strings of text. There is no inherent order to the strings and the possible values for the attribute connected to an object could be, for example "apple", "orange", "banana", or "kiwi". The column I want looks for in-links from another module and each in-link can have multiple values for the attribute in question. Ultimately I want the values to be ordered "orange", "banana", "kiwi", "apple" depending on what values each linked objects have. For example, if the linked object contains all 4 then you would get the list of the full order. If it only has banana and apple you would return the value for the column "banana" , "kiwi". Sorry I don't have a code sample. At this point it would just be the stock layout column DXL though. Thanks for any help.
If your real world is really as simple as your example, it might be sufficient to just have a combination of if statement s, like (pseudocode)
if linked_values contains "orange"
display "orange\n"
if linked_values contains "banana"
display "banana\n"
and you have a nice, sorted list of values.
If not, you need real sorting.
Sorting in DXL is usually done using skip lists. When you iterate over a skip list, you get the values in the order of the sorted keys (note that keys are unique, there cannot be two objects with the same key in a skip list).
So, it would be your task to create a mapping that for each entry to be stored calculates a key that represents the correct order and a temporary skip list.
If I understand your example correctly, you would have a mapping
orange: a
banana: b
kiwi: c
apple: d
Let's assume that there may be multiple oranges per object and you want to list all of them, because you do not only want to display the fruit but also some attribute like size or quality. In this case, you would create sort keys like this:
Object 1 has linked objects with the values: first apple (big), second apple (small), kiwi (medium), third apple (big), orange. This would make the following skip list:
key: d001, value: apple (big)
key: d002, value: apple (small)
key: c003, value: kiwi (medium)
key: d004, value: apple (big)
key: a005, value: orange
If you want to sort first by fruit, then by size, and you code your sizes by a: big, b: medium, c: small, d: undefined, you would have keys like:
da001
dc002
cb003
da004
ad005

Arrays of arrays or relationships in neo4j

The question came up after reading Natural Language Analytics made simple and visual with Neo4j blog entry created by Michael Hunger
When a word is used by more than one sentence (or more than one time in the same sentence), this word will have two or more [NEXT] relationships. In order to know the correct path for each sentence we need to store the segment id and the position id [sid,idx]
Storing one instance is clear, it create an array with two values. But, how do we add two or more arrays? As far as I know, neo4j only accepts basic data types
Instead of using this solution, would it make sense to store one [NEXT] relationship for each sentence path? Of course this would generate a very big amount on relationships
Thanks
NOTE: In the referenced article, there is a typo on the last line of the query in the "I also want to sentence number and word position" section. That is, r.pos = r.pos = [sid,idx] should be: r.pos = r.pos + [sid,idx].
When you use the + operator on 2 collections, you end up with a single collection that merges the contents of the 2 original collections. So, if r.pos starts out as [1, 2], then r.pos + [3, 4] will produce: [1, 2, 3, 4].
Therefore, the article does not an "array of arrays" problem.

Cleaning data in SPSS with name misspellings

I have a 5M records dataset in this basic format:
FName LName UniqueID DOB
John Smith 987678 10/08/1976
John Smith 987678 10/08/1976
Mary Martin 567834 2/08/1980
John Smit 987678 10/08/1976
Mary Martin 768987 2/08/1980
The DOB is always unique, but I have cases where:
Same ID, different name spellings or Different ID, same name
I got as far as making SPSS recognize that John Smit and John Smith with the same DOB are the same people, and I used aggregate to show how many times a spelling was used near the name (John Smith, 10; John Smit 5).
Case 1:
What I would like to do is to loop through all the records for the people identified to be the same person, and get the most common spelling of the person's name and use that as their standard name.
Case 2:
If I have multiple IDs for the same person, take the lowest one and make that the standard.
I am comfortable using basic syntax to clean my data, but this is the only thing that I'm stuck on.
If UniqueID is a real unique ID of individuals in the population and you are wanting to find variations of name spellings (within groupings of these IDs) and assign the modal occurrence then something like this would work:
STRING FirstLastName (A99).
COMPUTE FirstLastName = CONCAT(FName," ", LName").
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID FirstLastName /Count=N.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES /BREAK=UniqueID /MaxCount=MAX(Count).
IF (Count<>MaxCount) FirstLastName =$SYSMIS.
AGGREGATE OUTFILE= * MODE=ADDVARIABLES OVERWRITE=YES /BREAK=UniqueID /FirstLastName=MAX(FirstLastName).
You could then also overwrite the FName and LName fields also but then more assumptions would have to be made, if for example, FName or LName can contain space characters ect.

How to sort a list of 1million records by the first letter of the title

I have a table with 1 million+ records that contain names. I would like to be able to sort the list by the first letter in the name.
.. ABCDEFGHIJKLMNOPQRSTUVWXYZ
What is the most efficient way to setup the db table to allow for searching by the first character in the table.name field?
The best idea right now is to add an extra field which stores the first character of the name as an observer, index that field and then sort by that field. Problem is it's no longer necessarily alphabetical.
Any suggestions?
You said in a comment:
so lets ignore the first letter part. How can I all records that start with A? All A's no B...z ? Thanks – AnApprentice Feb 21 at 15:30
I issume you meant How can I RETURN all records...
This is the answer:
select * from t
where substr(name, 1, 1) = 'A'
I agree with the questions above as to why you would want to do this -- a regular index on the whole field is functionally equivalent. PostgreSQL (with some new ones in v. 9) has some rather powerful indexing capabilities for special cases which you might want to read about here http://www.postgresql.org/docs/9.1/interactive/sql-createindex.html

Help me to get better understanding of Digg's Cassandra data model

http://about.digg.com/blog/looking-future-cassandra
I've found this article about Digg's move to Cassandra. But I didn't get the author's idea of Bucket for pair (user,item). Little more details on the idea would be helpful to me to understand the solution better.
Thanks
It sounds like they are using one row in a super column family per user with one super column per item; a subcolumn for an item super column represents a friend who dugg the item. At least in pycassa, this makes an insert as simple as:
column_family.insert(user, {item: {friend: ''}})
They could also have done this a couple of other ways, and I'm not sure which they chose.
One is to use a standard column family, use a (user,item) combination for the row key, and use one column per friend who dugg the item:
column_family.insert(user + item, {friend: ''})
Another is to use a standard column family, use just (user) for the row key, and use an (item, friend) combination for the column name:
column_family.insert(user, {item + friend: ''})
Doesn't sound like this is what they used, but it's an acceptable option as well.

Resources