I've got a dataset with repeated measures that looks roughly like this:
ID v1 v2 v3 v4
1 3 4 2 NA
1 2 NA 6 7
2 4 3 6 4
2 NA 2 7 9
. . . . .
n . . . .
What I want to know is how many NAs are there for each participants over the variables v1 - v4 (e.g. participant 1 is missing 2 of 8 responses)?
Missing Values are always displayed per Variable not per participant so how do I do this? Maybe there is a way using the AGGREGATE command with ID as BREAK?
Use COUNT to count the missing values as a new variable and then aggregate by the Id or split files by I'd and freq.
Related
I have a dataset where each ID has visited a website and recorded their risk level which is coded 0-3. They have then returned to the website at a future date and recorded their risk level again. I want to calculate the difference between each ID's risk level from their first recorded risk level.
For example my dataset looks like this:
ID Timestamp RiskLevel
1 20-Jan-21 2
1 04-Apr-21 2
2 05-Feb-21 1
2 12-Mar-21 2
2 07-May-21 3
3 09-Feb-21 2
3 14-Mar-21 1
3 18-Jun-21 0
And I would like it to look like this:
ID Timestamp RiskLevel DifFromFirstRiskLevel
1 20-Jan-21 2 .
1 04-Apr-21 2 0
2 05-Feb-21 1 .
2 12-Mar-21 2 1
2 07-May-21 3 2
3 09-Feb-21 2 .
3 14-Mar-21 1 -1
3 18-Jun-21 0 -2
What should I do?
One way to approach this is with the strategy in my answer here, but I will use a different approach here:
sort cases by ID timestamp.
compute firstRisk=risklevel.
if $casenum>1 and ID=lag(ID) firstRisk=lag(firstRisk).
execute.
compute DifFromFirstRiskLevel=risklevel-firstRisk.
I am trying to perform a randomforest survival analysis according to the RANDOMFORESTSRC vignette in R. I have a data frame containing 59 variables - where 14 of them are numeric and the rest are factors. 2 of the numeric ones are TIME (days till death) and DIED (0/1 dead or not). I'm running into 2 problems:
trainrfsrc<- rfsrc(Surv(TIME, DIED) ~ .,
data = train, nsplit = 10, na.action = "na.impute")
trainrfsrc gives: Error rate: 17.07%
works fine, however exploring the error rate such as:
plot(gg_error(trainrfsrc))+ coord_cartesian(y = c(.09,.31))
returns:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
or:
a<-(gg_error(trainrfsrc))
a
error ntree 1 NA 1 2 NA 2 3 NA 3 4 NA 4 5 NA 5 6 NA 6 7 NA 7 8 NA 8 9 NA 9 10 NA 10
NA for all 1000 trees.how come there's no error rate for each number of trees tried?
the second problem is when trying to explore the most important variables using VIMP such as:
plot(gg_vimp(trainrfsrc)) + theme(legend.position = c(.8,.2))+ labs(fill = "VIMP > 0")
it returns:
In gg_vimp.rfsrc(trainrfsrc) : rfsrc object does not contain VIMP information. Calculating...
Any ideas? Thanks
Setting the err.block=1 (or some integer between 1 and ntree) should fix the problem of returning NA for error. You can check the help file under rfsrc to read more about err.block.
I have following temperature values stored inside Prometheus DB (each minute):
4
7
11
52
97
19
95
89
43
19
. . .
Now, I would like to get average temperature in each 5 minute interval.
/api/v1/query_range?query=avg_over_time(current_temp[5m])&start=1475483802.739&end=1475498202.739&step=300&_=1475493021942
I get following data back:
"values":[[1475488602.739,"4"],[1475488902.739,"37.2"],[1475489202.739,"51"],[1475489502.739,"79.6"] . . .
I really can not relate these values (4, 37.2, 51, 79.6 ...) with average data. Can some one help me with this?
Thanks
Here are two example through Prometheus graphing tool:
Let me answer my own question, the thing is that with the query I gave here:
/api/v1/query_range?query=avg_over_time(current_temp[5m])&start=1475483802.739&end=1475498202.739&step=300&_=1475493021942
following happens:
Each 300 seconds (from step parameter), read current temperature five minutes before that (each point you have) and calculate average from that. Do this in timespan between 1475483802.739 and 1475498202.739.
More information here https://github.com/prometheus/prometheus/issues/2051
I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.
I would like to create the last column.Thank you in advance!
You could try something like this:
/*************************************/.
DATA LIST FREE /v1 v2 v3 v4 v5.
BEGIN DATA
1 2 99 4 5
99 2 3 99 5
1 99 3 4 5
1 2 99 99 5
1 99 99 99 5
99 2 99 99 99
END DATA.
DATASET NAME DS1.
/*************************************/.
/* Solution1: Assumes v1 to v5 can hold any value from 1 to 5 */.
recode v1 to v5 (99,sysmis=sysmis) (else=copy).
do repeat v=v1 to v5.
if (any(v,1,4,5)) Target1=1.
if (any(v,2,3)) Target2=2.
end repeat.
compute TargetA=sum(Target1,Target2).
/* Solution2: Alternative solution which assumes v1 holds values 1 only v2 values 2 only ect... */.
recode v1 to v5 (99,sysmis=sysmis) (else=1).
compute TargetB=sum(any(1,v1,v4,v5)*1, any(1,v2,v3)*2).
exe.
If I understand you correctly:
Your input file contains 5 columns, 1 per channel
Each channel-specific column is filled with channel-specific identifier (1-5)
When the column is empty, that channel is not used / not relevant for that observation
You want to summarize the mix of channels used in new field (NewVar)
You want to use the IF statement in the SPSS syntax
The answer above by JigneshSutar does not seem to do this. Also, you do not need the do-repeat-loops but can do this in 3 lines (+EXECUTE.) of syntax (using the data generator in the answer by JigneshSutar):
IF (V1 = 1 & V4 = 4 & V5 = 5) NewVar = 1.
IF (V2 = 2 & V3 = 3) NewVar = 2.
IF (V1 = 1 & V2 = 2 & V3 = 3 & V4 = 4 & V5 = 5) NewVar = 3.
EXECUTE.
This syntax can easily be adjusted when the channel columns are filled with other values than the channel identifiers [1-5], for instance by using the missing function:
IF (MISSING(V1)=0 & MISSING(V4)=0 & MISSING(V5)=0) NewVar = 1.
IF (MISSING(V2)=0 & MISSING(V3)=0) NewVar = 2.
IF (MISSING(V1)=0 & MISSING(V2)=0 & MISSING(V3)=0& MISSING(V4)=0 & MISSING(V5)=0) NewVar = 3.
EXECUTE.