For a school project, I've found myself playing with data from the Census Bureau's Current Population Survey. I've chosen SPSS to work with the data, because it seemed like the easiest piece of software to jump right into given my limited timeframe. Everything seems pretty straightforward, except for one operation that's giving me trouble.
For each case in my dataset--each case representing an individual surveyed--the following (relevant) variables are defined:
Household ID (HHID)--a number unique to each household surveyed
Person ID (PID)--a number unique to each person within the household
The person's age (AGE)
Whether or not the person received public health insurance--a 0 or 1 (HASHEALTH)
The person ID of the individual's father, if one exists in the household (0 if none exists) (POPNUM)
The person ID of the individual's mother, if one exists in the household (0 if none exists) (MOMNUM)
Here's the problem: I need to set the KIDHASHEALTH value of any given parent to the HASHEALTH value of the youngest person whose HHID and POPNUM or MOMNUM value match the HHID and PID of the current case--functionally, their youngest child.
So far, I've been unable to figure out how to do this using SPSS syntax. Can anybody think of a way to accomplish what I'm trying to do, with syntax or otherwise?
Many, many thanks in advance.
Edited with sample data:
HHID |PID |AGE |POPNUM |MOMNUM |HASHEALTH |KIDHASHEALTH
-----+----+----+-------+-------+----------+------------
1 |1 |45 |0 |0 |0 |0 //KIDHASHEALTH == 0 because
1 |2 |48 |0 |0 |0 |0 //youngest child's HASHEALTH == 0
1 |3 |13 |1 |2 |0 |0
2 |1 |33 |0 |0 |0 |1 // == 1 because youngest child's
2 |2 |28 |0 |0 |0 |1 // HASHEALTH == 1
2 |3 |15 |1 |2 |0 |0
2 |4 |12 |1 |2 |1 |0
-----+----+----+-------+-------+----------+------------
The code below was tested only on your small data snippet. So, no guarantees for all the data with their peculiarities. The code makes the assumption that AGE is integer.
*Let's add small fractional noise to those children AGE who HASHEALTH=1.
*In order to insert the info about health right into the age number.
if hashealth age= age+rv.unif(-.1,+.1).
*Turn to fathers. Combine POPNUM and PID numbers in one column.
compute parent= popnum. /*Copy POPNUM as a new var PARENT.
if parent=0 parent= pid. /*and if the case is not a child, fill there PID.
*Now a father and his children have the same code in PARENT
*and so we can propagate the minimal age in that group (which is the age of the
*youngest child, provided the man has children) to all cases of the group,
*including the father.
aggregate /outfile= * mode= addvari
/break= hhid parent /*breaking is done also by household, of course
/youngage1= min(age). /*The variable showing that minimal age.
*Turn to mothers and do the same thing.
compute parent= momnum.
if parent=0 parent= pid.
aggregate /outfile= * mode= addvari
/break= hhid parent
/youngage2= min(age). /*The variable showing that minimal age.
*Take the minimal value from the two passes.
compute youngage= min(youngage1,youngage2).
*Compute binary KIDHASHEALTH variable.
*Remember that YOUNGAGE is not integer if that child has HASHEALTH=1.
compute kidhashealth= 0.
if popnum=0 and momnum=0 /*if we deal with a parent
and age<>youngage /*and the youngage age listed is not their own
and rnd(youngage)<>youngage kidhashealth= 1. /*and the age isn't integer, assign 1.
compute age= rnd(age). /*Restore integer age
exec.
delete vari parent youngage1 youngage2 youngage.
Related
This might be quite basic question, but:
Lets say I have a study where reaction times are measured twice before drinking alcohol and twice after drinking specific amount of alcohol, and hypothesis is that alcohol would increase the reaction time.
I have got my data in SPSS in the following format:
id | name| time_a | time_b | time_mean | time_a_alcohol | time_b_alcohol | time_mean_alcohol|
1| john| 0.17| 0.21| 0.19| 0.20| 0.24| 0.22|
2| bob| 0.15| 0.25| 0.20| 0.20| 0.30| 0.35|
I would like to do a independent Samples t-test, which I believe I could do if the data were set as following
id | name| alcohol| time_a | time_b | time_mean|
1| john| 0| 0.17| 0.21| 0.19|
1| john| 1| 0.20| 0.24| 0.22|
2| bob| 0| 0.15| 0.25| 0.20|
2| bob| 1| 0.20| 0.30| 0.25|
Where I could have the alcohol as the grouping value. However, my data isn't in that format as of now, as all of it is in one row.
Is there an option to do in the SPSS with one row so I could "time_mean" and "time_mean_alcohol" grouped without having to put them on two different rows; if not, is there a simple script to write to split the data?
You could calculate those means in the same row (and then run the analysis on them) like this:
compute time_mean=mean(time_a, time_b).
compute time_mean_alcohol=mean(time_a_alcohol, time_b_alcohol).
On the other hand, you can reach the long format as you described using this code:
varstocases /make time_a from time_a time_a_alcohol/make time_b from time_b time_b_alcohol/index=ind(time_a).
compute alcohol=char.index(ind, "alcohol")>0.
compute time_mean=mean(time_a, time_b).
exe.
NOTE: this looks to me like a case for paired-samples test rather than independant samples.
I am using Jupyter Notebook to run cypher queries via py2neo and pygds libraries. I can run the cypher queries via neo4j browser too.
This command gives me the counts:
gds.run_cypher('MATCH u=(p:Nodelabel1 {property1: "Nodelabel2"})-[r:relationship1]->WHERE r.integer > 10 RETURN Count (u)')
Output is:
| | Count (u) |
|0 | 526 |
I want to set these counts as 1's and the ones which are not in this count as 0's i:e which are r.integer < 10, under Nodelabel1 with a new variable/property name ofcourse.
I tried it using the two rows method Nathan Smith has shown below. I got the counts for the two categories too. But when I try
gds.run_cypher('MATCH(u:Nodelabel1) RETURN u.category AS category, count(u) AS cnt')
I get:
| | category | cnt
|0 | None | 120711
I expect to get:
| | category | cnt
|0 | Less than or equal to 10 | 20733
|1 | Greater than 10 | 526
If you want the answer in two rows, do it this way:
gds.run_cypher("""
MATCH (p:Nodelabel1 {property1: "Nodelabel2"})-[r:relationship1]->()
RETURN
CASE WHEN r.integer > 10 THEN "Greater than 10"
ELSE "Less than or equal to 10" END AS category,
Count (*) AS count
""")
If you want the answer in two columns, do it this way:
gds.run_cypher("""
MATCH (p:Nodelabel1 {property1: "Nodelabel2"})-[r:relationship1]->()
WITH
SUM(CASE WHEN r.integer > 10 THEN 1
ELSE 0 END) AS greaterThanTen,
Count (*) AS total
RETURN greaterThanTen, total - greaterThanTen AS lessThanOrEqualToTen
""")
I have a table:
id| name | organisation_name|flag |priority|salary
1 | Mark | organisation 1 |null |1 |100.00
2 | Inna | organisation 1 |null |2 |400.00
3 | Marry| organisation 1 |null |3 |500.00
4 | null | organisation 1 |250.00|null |null
5 | Grey | organisation 2 |null |1 |600.00
6 | Holly| organisation 2 |null |2 |400.00
8 | null | organisation 2 |150.00|null
The procedure should deduct the flag from the salary of a particular organization by priority. The result for this above is below.
Result:
id| name | organisation_name|flag |priority|salary
1 | Mark | organisation 1 |null |1 |0.00
2 | Inna | organisation 1 |null |2 |250.00
3 | Marry| organisation 1 |null |3 |500.00
4 | null | organisation 1 |250.00|null |null
5 | Grey | organisation 2 |null |1 |450.00
6 | Holly| organisation 2 |null |2 |400.00
8 | null | organisation 2 |150.00|null
I created Pl/sql block for this, but is so slow on one million records.
What is the fastest way to do this?
No need for PL/SQL here. SQL has sufficient capabilities to do this and should be fast enough:
MERGE INTO orgs o
USING (SELECT o.id,
greatest(o.salary - greatest(0, f.flag - nvl(sum(o.salary) over (partition by o.organisation_name order by o.priority rows between unbounded preceding and 1 preceding), 0)), 0) as salary
FROM orgs o
LEFT JOIN (SELECT organisation_name, flag FROM orgs WHERE flag IS NOT NULL) f
ON (f.organisation_name = o.organisation_name)
WHERE o.priority IS NOT NULL) f
ON (f.id = o.id)
WHEN MATCHED THEN UPDATE SET o.salary = f.salary;
For clarification: the query in USING clause is obtaining updated salaries by using window function to calculate moving total within the organisation name. Then we merge it into original table.
To speed it up try using BULK COLLECT and FORALL options in your PL/SQL code, or as mentioned above, do it in SQL.
If a user ordered same product with two different order_id;
The orders are created within a same date-hour granularity, for example
order#1 2019-05-05 17:23:21
order#2 2019-05-05 17:33:21
In the data warehouse, should we put them into two rows like this (Option 1):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 1 |
| 002 | 1111 | 22 | 123 | 456 | 10 | 2 |
Or just put them in one row with the aggregated quantity (Option 2):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 3 |
I know if I put the order_id as a degenerate dimension in the fact table, it should be Option 1. But in our case, we don't really want to keep the order_id.
Also I once read an article that says that when all dimensions are filtered out, there should be only one row of data in the fact table. If this statement is correct, the Option 2 will be the choice.
Is there a principle where I can refer ?
Conceptually, fact tables in a data warehouse should be designed at the most detailed grain available. You can always aggregate data from the lower granularity to the higher one, while the opposite is not true - if you combine the records, some information is lost permanently. If you ever need it later (even though you might not see it now), you'll regret the decision.
I would recommend the following approach: in a data warehouse, keep order number as degenerate dimension. Then, when you publish a star schema, you might build a pre-aggregated version of the table (skip order number, group identical records by date/hour). This way, you can have smaller/cleaner fact table in your dimensional model, and yet preserve more detailed data in the DW.
We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.