I am trying to run descriptives (Means/frequencies) on my data that are in long format/repeated measures. So for example, for 1 participant I have:
Participant Age ID 1 25 ID 1 25 ID 1 25 ID 1 25 ID 2 (Second participant .. etc) 30
So SPSS reads that as an N of 5 and uses that to compute the mean. I want SPSS to ignore repeated cases (Only read ID 1 data as one person, ignore the other 3). How do I do this?
Assuming the ages are always identical for all occurrences of the same ID - what you should do is aggregate (Data => aggregate) your data into a separate dataset, in which you'll take only the first age for each ID. Then you can analyse the age in the new dataset with no repetitions.
you can use this syntax:
DATASET DECLARE OneLinePerID.
AGGREGATE /OUTFILE='OneLinePerID' /BREAK=ID /age=first(age) .
dataset activate OneLinePerID.
means age.
Related
I am building an ms access db to manage part numbers of mixtures. It’s pretty much a bill of materials. I have a table, tblMixtures that references itself in the PreMixture field. I set this up so that a mixture can be a pre-mixture in another mixture, which can in turn be a pre-mixture in another mixture, etc. Each PartNumber in tblMixture is related to many Components in tblMixtureComponents by the PartNumber. The Components and their associated data is stored in tblComponentData. I have put in example data in the tables below.
tblMixtures
PartNumber
Description
PreMixtures
1
Mixture 1
4, 5
2
Mixture 2
4, 6
3
Mixture 3
4
Mixture 4
3
5
Mixture 5
6
Mixture 6
tblMixtureComponents
ID
PartNumber
Component
Concentration
1
1
A
20%
2
1
B
40%
3
1
C
40%
4
2
A
40%
5
2
B
30%
6
2
D
30%
tblComponentData
ID
Name
Density
Category
1
A
1.5
O
2
B
2
F
3
C
2.5
I
4
D
1
F
I have built the queries needed to pull the information together for the final mixture and even display the details of the pre-mixtures and components used for each mixture. However, with literally tens of thousands of part numbers, there can be a lot of overlap in pre-mixtures used for mixtures. In other words, Mixture 4 can be used as a pre-mixture for Mixture 1 and Mixture 2 and a lot more. I want to build a query that will identify all possible mixtures that can be used as a pre-mixture in a selected mixture. So I want a list of all the mixtures that have the same components or subset of components as the selected mixtures. The pre-mixture doesn’t have to have all the components in the mixture, but it can’t have any components that are not in the mixture.
If you haven't solved it yet...
The PreMixtures column storing a collection of data is a sign that you need to "Normalize" your database design a little more. If you are going to be getting premixture data from a query then you do not need to store this as table data. If you did, you would be forced to update the premix data every time your mixtures or components changed.
Also we need to adress that tblMixtures doesn't have an id field. Consider the following table changes:
tblMixture:
id
description
1
Mixture 1
2
Mixture 2
3
Mixture 3
tblMixtureComponent:
id
mixtureId
componentId
1
1
A
2
1
B
3
1
C
4
2
A
5
2
B
6
2
D
7
3
A
8
4
B
I personally like to use column naming that exposes primary to foreign key relationships. tblMixtures.id is clearly related to tblMixtureComponenets.mixtureId. I am lazy so i would also probably abreviate everything too.
Now as far as the query, first lets get the components of mixture 1:
SELECT tblMixtureComponent.mixtureId, tblMixtureComponent.componentId
FROM tblMixtureComponent
WHERE tblMixtureComponent.mixtureId = 1
Should return:
mixtureId
componentId
1
A
1
B
1
C
We could change the WHERE clause to the id of any mixture we wanted. Next we need to get all the mixture ids with bad components. So we will build a join to compare around the last query:
SELECT tblMixtureComponent.mixtureId
FROM tblMixtureComponenet LEFT JOIN
(SELECT tblMixtureComponent.mixtureId,
tblMixtureComponent.componentId
FROM tblMixtureComponent
WHERE tblMixtureComponent.mixtureId = 1) AS GoodComp
ON tblMixtures.componentId = GoodComp.componentId
WHERE GoodComp.componentId Is Null
Should return:
mixtureId
2
Great so now we have ids of all the mixtures we don't want. Lets add another join to get the inverse:
SELECT tblMixture.id
FROM tblMix LEFT JOIN
(SELECT tblMixtureComponent.mixtureId
FROM tblMixtureComponenet LEFT JOIN
(SELECT tblMixtureComponent.mixtureId,
tblMixtureComponent.componentId
FROM tblMixtureComponent
WHERE tblMixtureComponent.mixtureId = 1) AS GoodComp
ON tblMixtures.componentId = GoodComp.componentId
WHERE GoodComp.componentId Is Null) AS BadMix
ON tblMixtures.id = BadMix.mixtureId
WHERE BadMix.mixtureId = Null AND tblMixture.id <> 1
Should return:
mixtureId
3
4
Whats left is all of the ids of that have similar components but not nonsimilar components to mixture 1.
Sorry i did this on a phone...
I have Stream Analytics job with
INPUTS:
1) "InputStreamCSV" - linked to Event hub and recievies data . InputStreamHistory
2) "InputStreamHistory" - Input stream linked BlobStorage. InputStreamCSV
OUTPUTS:
1) "AlertOUT" - linked to table storage and inserts alarm event as row in table
I want to calculate AVERAGE amount for all transactions for year 2018(one number - 5,2) and compare it with transaction, that is comming in 2019:
If new transaction amount is bigger than average - put that transaction in "AlertOUT" output.
I am calculating average as :
SELECT AVG(Amount) AS TresholdAmount
FROM InputStreamHistory
group by TumblingWindow(minute, 1)
Recieving new transaction as:
SELECT * INTO AlertOUT FROM InputStreamCSV TIMESTAMP BY EventTime
How can I combine this 2 queries to be able to check if new transaction amount is bigger than average transactions amount for last year?
Please use JOIN operator in ASA sql,you could refer to below sql to try to combine the 2 query sql.
WITH
t2 AS
(
SELECT AVG(Amount) AS TresholdAmount
FROM jsoninput2
group by TumblingWindow(minute, 1)
)
select t2.TresholdAmount
from jsoninput t1 TIMESTAMP BY EntryTime
JOIN t2
ON DATEDIFF(minute,t1,t2) BETWEEN 0 AND 5
where t1.Amount > t2.TresholdAmount
If the history data is stable, you also could join the history data as reference data.Please refer to official sample.
If you are comparing last year's average with current stream, it would be better to use reference data. Compute the averages for 2018 using either asa itself or a different query engine to a storage blob. After that you can use the blob as reference data in asa query - it will replace the average computation in your example.
After that you can do a reference data join with inputStreamCsv to produce alerts.
Even if you would like to update the averages once in a while, above pattern would work. Based on the refresh frequency, you can either use another asa job or a batch analytics solution.
Is there an efficient way in SAS to verify if a join you ran was a 1 to 1 or a 1 to many join? I often work with tables that do not have a clear unique identifier which has led me to running 1-many joins thinking they were 1-1, thus messing up my analysis.
In the simple case where I'm expecting the input datasets for a merge to be unique by some key, I will often code a simple assertion into the merge that throws an error if any duplicates are found:
Sample data:
data one;
do id=1,2,3;
output;
end;
run;
data two;
do id=1,2,2,3,4,4;
output;
end;
run;
Log:
16 data want;
17 merge one two;
18 by id;
19 if not (first.id and last.id) then put "ERROR: duplicates!" id=;
20 run;
ERROR: duplicates!id=2
ERROR: duplicates!id=2
ERROR: duplicates!id=4
ERROR: duplicates!id=4
NOTE: There were 3 observations read from the data set WORK.ONE.
NOTE: There were 6 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 6 observations and 1 variables
That doesn't tell you which dataset has duplicates (for that you need to use in= variables like Tom's answer), but it's an easy safety net to catch duplicates.
You can also just check your output dataset for duplicates after the merge, e.g.
data _null_;
set want (keep=id);
by id;
if not (first.id and last.id) then put "ERROR: Duplicate ! " id=;
run;
Duplicates are dangerous.
You can use the IN= flags, but you need to clear them.
Let's make some sample datasets.
data one;
do id=1,2,2,3;
output;
end;
run;
data two;
do id=1,1,2,2,3,3;
output;
end;
run;
Now merge them by ID. Clear the IN= variables before the MERGE statement so that the flag is not carried forward on the dataset with just a single observation.
data want ;
call missing(in1,in2);
merge one(in=in1) two (in=in2);
by id;
if not first.id and sum(of in1-in2)> 1 then put 'Multiple Merge: ' (_n_ id in1 in2) (=);
run;
Results in the LOG.
Multiple Merge: _N_=4 id=2 in1=1 in2=1
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 6 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 6 observations and 1 variables.
Checking before merging is a better idea... Here are two nice and easy ways to do it. (Supposing we have a dataset named one with column id to be used for the merge).
Identify duplicate id's with PROC FREQ
proc freq data = one noprint;
table id /out = freqs_id_one(where=(count>1));
run;
Sort dataset using nodupkey
...redirecting duplicate id's in a distinct dataset:
proc sort data=one nodupkey out=one_nodupids dupout=one_dupids;
by id;
run;
Checking after-the-fact
If you realize too late that you didn't check for dupes (doh!), you can obtain the frequencies of the id with PROC FREQ (same code as above) or with a PROC SQL query:
proc sql;
select id,
count(id) as count
from merged_dataset
group by id
having count > 1;
quit;
I need to create a dimensional environment for sales analysis for a retail company.
The hierarchy that will be present in my Sales fact is:
1 - Country
1.1 - Region
1.1.1 - State
1.1.1.1 - City
1.1.1.1.1 - Neighbourhood
1.1.1.1.1.1 - Store
1.1.1.1.1.1.1 - Section
1.1.1.1.1.1.1.1 - Catgory
1.1.1.1.1.1.1.1.1 - Subcatgory
1.1.1.1.1.1.1.1.1.1 - Product
Metrics such as Number of Sales, Revenue and Medium Ticket (Revenue / Number of Sales) makes sense up to the Subcategory level, because if I reach the Product level the agreggation composition will need to change (I guess).
Also, metrics such as Productivity, which is Revenue / Number of Store Staff, won't make sense to existe in this fact table, because it only works up to the Store level (also, I guess).
I'd like to know the best solution resolve this question because all of it are about Sales, but some makes sense until a specifict level of my hierarchy and others don't.
Waiting for the reply and Thanks in advance!
You should split your hierarchy into 2 dimensions, Stores and Products
The Stores dimension is all about the Location of the sale, and you can put the number of employees in this dimension
Store_Key STORE Neighbourhood City Country Num_Staff
1 Store1 4th Street LA US 10
2 Store2 Main Street NY US 2
The products dimension looks like
Product_Key Prod_Name SubCat Category Unit_Cost
1 Cheese Sticks Diary Food $2.00
2 Timer Software Computing $25.00
The your fact table has a record for each Sale, and is keyed to the above dimensions
Store_Key Product_Key Date Quantity Tot_Amount
1 1 31/7/2014 5 $10.00 (store1 sells 5 cheese)
1 2 31/7/2014 1 $25.00 (store1 sells 1 timer)
2 1 31/7/2014 3 $6.00 (store2 sells 3 cheese)
2 2 31/7/2014 1 $25.00 (store2 sells 1 timer)
Now that your data is in place you can use your reporting tool to get the measures you need. Example SQL is something like below
SELECT store.STORE,
SUM(fact.tot_amount) as revenue,
COUNT(*) as num_sales
SUM(fact.tot_amount) / store.NumStaff as Productivity
FROM tbl_Store store, tb_Fact fact
WHERE fact.Store_key = store.Store_key
GROUP BY store.STORE
should return the following result
STORE revenue num_sales Productivity
Store1 $35.00 2 3.5
Store2 $31.00 2 15.5
In my ETL process I am using Change Data Capture (CDC) to discover only rows that have been changed in the source tables since the last extraction. Then I do the transformation only for this rows. The problem is when I have for example 2 tables which I want to join into one dimension, and only one of them has changed. For example I have table Countries and Towns as following:
Countries:
ID Name
1 France
Towns:
ID Name Country_ID
1 Lyon 1
Now lets say a new row is added to Towns table:
ID Name Country_ID
1 Lyon 1
2 Paris 2
The Countries table has not been changed, so CDC for these tables shows me only the row from Towns table. The problem is when I do the join between Countries and Towns, there is no row in Countries change set, so the join will result in empty set.
Do you have an idea how to solve it? Of course there might be more difficult cases, involving 3 and more tables, and consequential joins.
This is a typical problem found when doing Realtime Change-Data-Capture, or even Incremental-only daily changes.
There's multiple ways to solve this.
One way would be to do your joins on the natural keys in the dimension or mapping table, to get the associated country (SELECT distinct country_name, [..other attributes..] from dim_table where country_id = X).
Another alternative would be to do the join as part of the change capture process - when a row is loaded to towns, a trigger goes off that loads the foreign key values into the associated staging tables (country, etc).
There is allot i could babble on for more information on but i will be specific to what is in your question. I would suggest the following to get the results...
1st Pass is where everything matches via the join...
Union All
2nd Pass Gets all towns where there isn't a country
(left outer join with a where condition that
requires the ID in the countries table to be null/missing).
You would default the Country ID value in that unmatched join to something designated as a "Unmatched Value" typically 0 or -1 is used or a series of standard -negative numbers that you could assign descriptions to later to identify why data is bad for your example -1 could be "Found Town Without Country".