ERROR while implementing Cox PH model for recurrent event survival analysis using counting process - recurrence

I have been trying to run Cox PH model on a sample data set of 10k customers (randomly taken from 32 million customer base) for predicting probability of survival in time t (which is month in my case). I am using recurrent event survival analysis using counting process for e-commerce. For this...
1. Observation starting point: right after a customer makes first purchase
2. Start/Stop times: Months of two consecutive purchases (as in the data)
I have a few independent variables as in the sample data below:
id start stop status tenure orders revenue Quantity
A 0 20 0 0 1 $89.0 1
B 0 17 0 0 1 $556.0 2
C 0 17 0 0 1 $900.0 2
D 32 33 0 1679 9 $357.8 9
D 26 32 1 1497 7 $326.8 7
D 23 26 1 1405 4 $142.9 4
D 17 23 1 1219 3 $63.9 3
D 9 17 1 978 2 $50.0 2
D 0 9 1 694 1 $35.0 1
E 0 15 0 28 2 $156.0 2
F 0 15 0 0 1 $348.0 1
F 12 14 0 375 2 $216.8 3
F 0 12 1 0 1 $67.8 2
G 9 15 0 277 2 $419.0 2
G 0 9 1 0 1 $359.0 1
While running cox PH using the following code:
fit10=coxph(Surv(start,stop,status)~orders+tenure+Quantity+revenue,data=test)
I keep getting the following error:
Warning: X matrix deemed to be singular; variable 1 2 3 4
I tried searching for the same error online but the answers I found said this could be because of interacting independent variables, whereas my variables are individual and continuous.

Related

when using the default 'randomForest' algorithm for classification, why doesn't the number of terminal nodes match the number of cases?

According to https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, classification trees are fully grown, meaning node size = 1.
However, if trees are really grown to a maximum, then shouldn't each terminal node contain a single case (data point, species, etc)?
If I run:
library(randomForest)
data(iris) #150 cases
set.seed(352)
rf <- randomForest(Species ~ ., iris)
hist(treesize(rf),main ="number of nodes")
I can see that most "fully grown" trees only have about 10 nodes, meaning node size can't be equal to 1...Right?
for example, (-1) below represents a terminal node for the 134th tree in the forest. Only 8 terminal nodes!?
> getTree(rf,134)
left daughter right daughter split var split point status prediction
1 2 3 3 2.50 1 0
2 0 0 0 0.00 -1 1
3 4 5 4 1.75 1 0
4 6 7 3 4.95 1 0
5 8 9 3 4.85 1 0
6 10 11 4 1.60 1 0
7 12 13 1 6.50 1 0
8 14 15 1 5.95 1 0
9 0 0 0 0.00 -1 3
10 0 0 0 0.00 -1 2
11 0 0 0 0.00 -1 3
12 0 0 0 0.00 -1 3
13 0 0 0 0.00 -1 2
14 0 0 0 0.00 -1 2
15 0 0 0 0.00 -1 3
I would be greatful if someone can explain
"Fully grown" -> "Nothing left to split". A (node of a-) decision tree is fully grown, if all data records assigned to it hold/make the same prediction.
In the iris dataset case, once you reach a node with 50 setosa data records in it, it doesn't make sense to split it into two child nodes with 25 and 25 setosas each.

Get a list of function results until result > x

I basically want the same thing as this OP:
Is there a J idiom for adding to a list until a certain condition is met?
But I cant get the answers to work with OP's function or my own.
I will rephrase the question and write about the answers at the bottom.
I am trying to create a function that will return a list of fibonacci numbers less than 2.000.000. (without writing "while" inside the function).
Here is what i have tried:
First, i picked a way to culculate fibonacci numbers from this site:
https://code.jsoftware.com/wiki/Essays/Fibonacci_Sequence
fib =: (i. +/ .! i.#-)"0
echo fib i.10
0 1 1 2 3 5 8 13 21 34
Then I made an arbitrary list I knew was larger than what I needed. :
fiblist =: (fib i.40) NB. THIS IS A BAD SOLUTION!
Finally, I removed the numbers that were greater than what I needed:
result =: (fiblist < 2e6) # fiblist
echo result
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1.34627e6
This gets the right result, but is there a way to avoid using some arbitrary number like
40 in "fib i.40" ?
I would like to write a function, such that "func 2e6" returns the list of fibonacci numbers below 2.000.000. (without writing "while" inside the function).
echo func 2e6
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1.34627e6
here are the answers from the other question:
first answer:
2 *^:(100&>#:])^:_"0 (1 3 5 7 9 11)
128 192 160 112 144 176
second answer:
+:^:(100&>)^:(<_) ] 3
3 6 12 24 48 96 192
As I understand it, I just need to replace the functions used in the answers, but i dont see how
that can work. For example, if I try:
echo (, [: +/ _2&{.)^:(100&>#:])^:_ i.2
I get an error.
I approached it this way. First I want to have a way of generating the nth Fibonacci number, and I used f0b from your link to the Jsoftware Essays.
f0b=: (-&2 +&$: -&1) ^: (1&<) M.
Once I had that I just want to put it into a verb that will check to see if the result of f0b is less than a certain amount (I used 1000) and if it was then I incremented the input and went through the process again. This is the ($:#:>:) part. $: is Self-Reference. The right 0 argument is the starting point for generating the sequence.
($:#:>: ^: (1000 > f0b)) 0
17
This tells me that the 17th Fibonacci number is the largest one less than my limit. I use that information to generate the Fibonacci numbers by applying f0b to each item in i. ($:#:>: ^: (1000 > f0b)) 0 by using rank 0 (fob"0)
f0b"0 i. ($:#:>: ^: (1000 > f0b)) 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
In your case you wanted the ones under 2000000
f0b"0 i. ($:#:>: ^: (2000000 > f0b)) 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269
... and then I realized that you wanted a verb to be able to answer your original question. I went with dyadic where the left argument is the limit and the right argument generates the sequence. Same idea but I was able to make use of some hooks when I went to the tacit form. (> f0b) checks if the result of f0b is under the limit and ($: >:) increments the right argument while allowing the left argument to remain for $:
2000000 (($: >:) ^: (> f0b)) 0
32
fnum=: (($: >:) ^: (> f0b))
2000000 fnum 0
32
f0b"0 i. 2000000 fnum 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269
I have little doubt that others will come up with better solutions, but this is what I cobbled together tonight.

why the result of method mostSimilarItems in mahout is not order by the weight?

I have the following codes:
ItemSimilarity itemSimilarity = new UncenteredCosineSimilarity(dataModel);
recommender = new GenericItemBasedRecommender(dataModel,itemSimilarity);
List<RecommendedItem> items = recommender.mostSimilarItems(10, 5);
my datamodel is like this:
uid itemid socre
userid itemid score
1 6 5
1 10 3
1 11 5
1 12 4
1 13 5
2 2 3
2 6 5
2 10 3
2 12 5
when I run the code above,the result is just like this:
13
6
11
2
12
I debug the code,and find that the List items = recommender.mostSimilarItems(10, 5); return the items has the same score,that is one!
so,I have a problem.in my opinion,I think the mostsimilaritem should consider the item co-occurrence matrix:
2 6 10 11 12 13
2 0 1 1 0 1 0
6 1 0 2 1 2 1
10 1 2 0 1 2 1
11 0 1 1 0 1 1
12 1 2 2 1 0 1
13 0 1 1 1 1 0
in the matrix above ,the item 12's most similar should be [6,12,11,13,2],because the item 1 and item 12 is more similar than the other items,isn't it?
now,anyone who can explain this for me?thanks!
In your matrix you have much more data than in your input. In particular you seem to be imputing 0 values that are not in the data. That is why you are likely getting answers different from what you expect.
Mahout expects your IDs to be contiguous Integers starting from 0. This is true of your row and column ids. Your matrix looks like it has missing ids. Just having Integers is not enough.
Could this be the problem? Not sure what Mahout would do with the input above.
I always keep a dictionary to map Mahout IDs to/from my own.

Automatically learning clusters

HI complete newbie question here: I have a table consisting of two columns. First column belongs to "bins" that are coded by where a the fruit flies live. The second column is either 0 or 1, neutral vs really like sugar, respectively. I have two question?
1) if I suspect that there is a single variable, something about where they live that is determining whether how much they like sugar. Is there a way that I can have the computer to group into just 2 clusters? All the bins that like sugar vs neutral. That way we can do further experiment to determine what is it about the bins.
2) automatically determine how many clusters there might be that is driving this behavior? For example may be there is 4 variables (4 clusters) that can determine the outcome of sugar preference.
Apologies if this is trivial. The table is listed below. thanks!
Bin sugar
1 1
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 1
3 1
4 1
4 1
4 1
5 1
5 0
5 1
6 0
6 0
6 0
7 0
7 1
7 1
8 1
8 0
8 1
9 1
9 0
9 0
9 0
10 0
10 0
10 0
11 1
11 1
11 1
12 0
12 0
12 0
12 0
13 0
13 0
13 1
13 0
13 0
14 0
14 0
14 0
14 0
15 1
15 0
15 0
16 1
16 1
17 1
17 1
18 0
18 1
18 1
17 1
19 1
20 1
20 0
20 0
20 1
21 0
21 0
21 1
21 0
22 1
22 0
22 1
22 1
23 1
23 1
24 1
24 0
25 0
25 1
25 0
26 1
26 1
27 1
27 1
Okay, assuming I understood what you meant, one approach to problem 1) should be addressed using bayes filtering.
Say event L is "a fly likes sugar", event B is "a fly is in bin B".
So what you have is:
number of flies = 84
size of each bins = (eg size of bin 1: 4)
probability that a fly likes sugar:
P(L) = flies that like sugar / total number of flies = 43/84
probability that a fly doesn't like sugar:
P(notL) = 1 - P(L) = 41/84
probability that a fly is in a given bin:
P(B) = size of the bin / sum of the sizes of all bins = 4/84 (for bin 1)
probability that a fly isn't in a given bin:
P(notB) = 1 - P(B) = 80/84 (for bin 1)
probability that a fly likes sugar, knowing that's in bin B:
P(L|B) = flies that like sugar in a bin / size of the bin
(eg for bin 1 is 2/4 = 1/2)
probability that a fly likes sugar, knowing that it's not in bin B:
P(L|notB) = (total flies that like sugar - flies that like sugar in the bin)/(size of bins - size of the bin)) = 41/80
You want to know the probability that a fly is in a given bin B knowing that likes sugar, which you can obtain with:
P(B|L) = (P(L|B) * P(B)) / (P(L|B) * P(B) + P(L|notB) * P(notB))
If you compute P(B|L) and P(B|notL) for each bin, then you know which of the bins have the highest probability of containing flies that like sugar. Then you can further study those bins.
Hope i was clear, my statistics is a bit rusty and I'm not even sure I am doing everything correctly. Take it as a hint to point you in the right direction to address the problem.
You can refer here to get more accurate reasoning and results.
As for problem 2)... I have to think about it a bit more.

Using COUNTIFS on 3 different columns and then need to SUM a 4th column?

I have written this formula below. I do not know the correct part of this formula that will add the numbers I have in Column AB2:AB552. As it is, this formula is counting the number of cells in that range that has numbers in it, but I need it to total those numbers as my final result. Any help would be great.
=COUNTIFS(Cases!B2:B552,"1",Cases!G2:G552,"c*",Cases!X2:X552,"No",**Cases!AB2:AB552,">0"**)
Assuming you don't actually need the intermediate counts, the sumifs function should give you the final result:
=SUMIFS(Cases!AB2:AB552,Cases!B2:B552,1,Cases!G2:G552,"c",Cases!X2:X552,"No",Cases!AB2:AB552,">0")
Testing this with some limited data:
Row B G X AB
2 2 a No 10
3 1 c No 24
4 2 c No 4
5 1 c No 0
6 1 a Yes 9
7 2 c No 12
8 2 c No 6
9 2 b No 0
10 1 b No 0
11 1 a No 10
12 2 c No 6
13 1 c No 20
14 1 c No 4
15 1 b Yes 22
16 1 b Yes 22
the formula above returned 48, the sum of AB3, AB13, and AB14, which were the only rows matching all 4 criteria

Resources