Automatically learning clusters - machine-learning

HI complete newbie question here: I have a table consisting of two columns. First column belongs to "bins" that are coded by where a the fruit flies live. The second column is either 0 or 1, neutral vs really like sugar, respectively. I have two question?
1) if I suspect that there is a single variable, something about where they live that is determining whether how much they like sugar. Is there a way that I can have the computer to group into just 2 clusters? All the bins that like sugar vs neutral. That way we can do further experiment to determine what is it about the bins.
2) automatically determine how many clusters there might be that is driving this behavior? For example may be there is 4 variables (4 clusters) that can determine the outcome of sugar preference.
Apologies if this is trivial. The table is listed below. thanks!
Bin sugar
1 1
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 1
3 1
4 1
4 1
4 1
5 1
5 0
5 1
6 0
6 0
6 0
7 0
7 1
7 1
8 1
8 0
8 1
9 1
9 0
9 0
9 0
10 0
10 0
10 0
11 1
11 1
11 1
12 0
12 0
12 0
12 0
13 0
13 0
13 1
13 0
13 0
14 0
14 0
14 0
14 0
15 1
15 0
15 0
16 1
16 1
17 1
17 1
18 0
18 1
18 1
17 1
19 1
20 1
20 0
20 0
20 1
21 0
21 0
21 1
21 0
22 1
22 0
22 1
22 1
23 1
23 1
24 1
24 0
25 0
25 1
25 0
26 1
26 1
27 1
27 1

Okay, assuming I understood what you meant, one approach to problem 1) should be addressed using bayes filtering.
Say event L is "a fly likes sugar", event B is "a fly is in bin B".
So what you have is:
number of flies = 84
size of each bins = (eg size of bin 1: 4)
probability that a fly likes sugar:
P(L) = flies that like sugar / total number of flies = 43/84
probability that a fly doesn't like sugar:
P(notL) = 1 - P(L) = 41/84
probability that a fly is in a given bin:
P(B) = size of the bin / sum of the sizes of all bins = 4/84 (for bin 1)
probability that a fly isn't in a given bin:
P(notB) = 1 - P(B) = 80/84 (for bin 1)
probability that a fly likes sugar, knowing that's in bin B:
P(L|B) = flies that like sugar in a bin / size of the bin
(eg for bin 1 is 2/4 = 1/2)
probability that a fly likes sugar, knowing that it's not in bin B:
P(L|notB) = (total flies that like sugar - flies that like sugar in the bin)/(size of bins - size of the bin)) = 41/80
You want to know the probability that a fly is in a given bin B knowing that likes sugar, which you can obtain with:
P(B|L) = (P(L|B) * P(B)) / (P(L|B) * P(B) + P(L|notB) * P(notB))
If you compute P(B|L) and P(B|notL) for each bin, then you know which of the bins have the highest probability of containing flies that like sugar. Then you can further study those bins.
Hope i was clear, my statistics is a bit rusty and I'm not even sure I am doing everything correctly. Take it as a hint to point you in the right direction to address the problem.
You can refer here to get more accurate reasoning and results.
As for problem 2)... I have to think about it a bit more.

Related

when using the default 'randomForest' algorithm for classification, why doesn't the number of terminal nodes match the number of cases?

According to https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, classification trees are fully grown, meaning node size = 1.
However, if trees are really grown to a maximum, then shouldn't each terminal node contain a single case (data point, species, etc)?
If I run:
library(randomForest)
data(iris) #150 cases
set.seed(352)
rf <- randomForest(Species ~ ., iris)
hist(treesize(rf),main ="number of nodes")
I can see that most "fully grown" trees only have about 10 nodes, meaning node size can't be equal to 1...Right?
for example, (-1) below represents a terminal node for the 134th tree in the forest. Only 8 terminal nodes!?
> getTree(rf,134)
left daughter right daughter split var split point status prediction
1 2 3 3 2.50 1 0
2 0 0 0 0.00 -1 1
3 4 5 4 1.75 1 0
4 6 7 3 4.95 1 0
5 8 9 3 4.85 1 0
6 10 11 4 1.60 1 0
7 12 13 1 6.50 1 0
8 14 15 1 5.95 1 0
9 0 0 0 0.00 -1 3
10 0 0 0 0.00 -1 2
11 0 0 0 0.00 -1 3
12 0 0 0 0.00 -1 3
13 0 0 0 0.00 -1 2
14 0 0 0 0.00 -1 2
15 0 0 0 0.00 -1 3
I would be greatful if someone can explain
"Fully grown" -> "Nothing left to split". A (node of a-) decision tree is fully grown, if all data records assigned to it hold/make the same prediction.
In the iris dataset case, once you reach a node with 50 setosa data records in it, it doesn't make sense to split it into two child nodes with 25 and 25 setosas each.

Get a list of function results until result > x

I basically want the same thing as this OP:
Is there a J idiom for adding to a list until a certain condition is met?
But I cant get the answers to work with OP's function or my own.
I will rephrase the question and write about the answers at the bottom.
I am trying to create a function that will return a list of fibonacci numbers less than 2.000.000. (without writing "while" inside the function).
Here is what i have tried:
First, i picked a way to culculate fibonacci numbers from this site:
https://code.jsoftware.com/wiki/Essays/Fibonacci_Sequence
fib =: (i. +/ .! i.#-)"0
echo fib i.10
0 1 1 2 3 5 8 13 21 34
Then I made an arbitrary list I knew was larger than what I needed. :
fiblist =: (fib i.40) NB. THIS IS A BAD SOLUTION!
Finally, I removed the numbers that were greater than what I needed:
result =: (fiblist < 2e6) # fiblist
echo result
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1.34627e6
This gets the right result, but is there a way to avoid using some arbitrary number like
40 in "fib i.40" ?
I would like to write a function, such that "func 2e6" returns the list of fibonacci numbers below 2.000.000. (without writing "while" inside the function).
echo func 2e6
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1.34627e6
here are the answers from the other question:
first answer:
2 *^:(100&>#:])^:_"0 (1 3 5 7 9 11)
128 192 160 112 144 176
second answer:
+:^:(100&>)^:(<_) ] 3
3 6 12 24 48 96 192
As I understand it, I just need to replace the functions used in the answers, but i dont see how
that can work. For example, if I try:
echo (, [: +/ _2&{.)^:(100&>#:])^:_ i.2
I get an error.
I approached it this way. First I want to have a way of generating the nth Fibonacci number, and I used f0b from your link to the Jsoftware Essays.
f0b=: (-&2 +&$: -&1) ^: (1&<) M.
Once I had that I just want to put it into a verb that will check to see if the result of f0b is less than a certain amount (I used 1000) and if it was then I incremented the input and went through the process again. This is the ($:#:>:) part. $: is Self-Reference. The right 0 argument is the starting point for generating the sequence.
($:#:>: ^: (1000 > f0b)) 0
17
This tells me that the 17th Fibonacci number is the largest one less than my limit. I use that information to generate the Fibonacci numbers by applying f0b to each item in i. ($:#:>: ^: (1000 > f0b)) 0 by using rank 0 (fob"0)
f0b"0 i. ($:#:>: ^: (1000 > f0b)) 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
In your case you wanted the ones under 2000000
f0b"0 i. ($:#:>: ^: (2000000 > f0b)) 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269
... and then I realized that you wanted a verb to be able to answer your original question. I went with dyadic where the left argument is the limit and the right argument generates the sequence. Same idea but I was able to make use of some hooks when I went to the tacit form. (> f0b) checks if the result of f0b is under the limit and ($: >:) increments the right argument while allowing the left argument to remain for $:
2000000 (($: >:) ^: (> f0b)) 0
32
fnum=: (($: >:) ^: (> f0b))
2000000 fnum 0
32
f0b"0 i. 2000000 fnum 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269
I have little doubt that others will come up with better solutions, but this is what I cobbled together tonight.

ERROR while implementing Cox PH model for recurrent event survival analysis using counting process

I have been trying to run Cox PH model on a sample data set of 10k customers (randomly taken from 32 million customer base) for predicting probability of survival in time t (which is month in my case). I am using recurrent event survival analysis using counting process for e-commerce. For this...
1. Observation starting point: right after a customer makes first purchase
2. Start/Stop times: Months of two consecutive purchases (as in the data)
I have a few independent variables as in the sample data below:
id start stop status tenure orders revenue Quantity
A 0 20 0 0 1 $89.0 1
B 0 17 0 0 1 $556.0 2
C 0 17 0 0 1 $900.0 2
D 32 33 0 1679 9 $357.8 9
D 26 32 1 1497 7 $326.8 7
D 23 26 1 1405 4 $142.9 4
D 17 23 1 1219 3 $63.9 3
D 9 17 1 978 2 $50.0 2
D 0 9 1 694 1 $35.0 1
E 0 15 0 28 2 $156.0 2
F 0 15 0 0 1 $348.0 1
F 12 14 0 375 2 $216.8 3
F 0 12 1 0 1 $67.8 2
G 9 15 0 277 2 $419.0 2
G 0 9 1 0 1 $359.0 1
While running cox PH using the following code:
fit10=coxph(Surv(start,stop,status)~orders+tenure+Quantity+revenue,data=test)
I keep getting the following error:
Warning: X matrix deemed to be singular; variable 1 2 3 4
I tried searching for the same error online but the answers I found said this could be because of interacting independent variables, whereas my variables are individual and continuous.

why the result of method mostSimilarItems in mahout is not order by the weight?

I have the following codes:
ItemSimilarity itemSimilarity = new UncenteredCosineSimilarity(dataModel);
recommender = new GenericItemBasedRecommender(dataModel,itemSimilarity);
List<RecommendedItem> items = recommender.mostSimilarItems(10, 5);
my datamodel is like this:
uid itemid socre
userid itemid score
1 6 5
1 10 3
1 11 5
1 12 4
1 13 5
2 2 3
2 6 5
2 10 3
2 12 5
when I run the code above,the result is just like this:
13
6
11
2
12
I debug the code,and find that the List items = recommender.mostSimilarItems(10, 5); return the items has the same score,that is one!
so,I have a problem.in my opinion,I think the mostsimilaritem should consider the item co-occurrence matrix:
2 6 10 11 12 13
2 0 1 1 0 1 0
6 1 0 2 1 2 1
10 1 2 0 1 2 1
11 0 1 1 0 1 1
12 1 2 2 1 0 1
13 0 1 1 1 1 0
in the matrix above ,the item 12's most similar should be [6,12,11,13,2],because the item 1 and item 12 is more similar than the other items,isn't it?
now,anyone who can explain this for me?thanks!
In your matrix you have much more data than in your input. In particular you seem to be imputing 0 values that are not in the data. That is why you are likely getting answers different from what you expect.
Mahout expects your IDs to be contiguous Integers starting from 0. This is true of your row and column ids. Your matrix looks like it has missing ids. Just having Integers is not enough.
Could this be the problem? Not sure what Mahout would do with the input above.
I always keep a dictionary to map Mahout IDs to/from my own.

High error on training set when asked to predict on training set but average loss while training is low

I am training a model using vowpal wabbit and notice something very strange. During training, the average loss reported is very low somewhere around 0.06. However I notice that when I asked the model to predict labels on the same training data, the average loss is high around ~0.66 and the model performs poorly on predicting labels for even the training data. My initial conjecture was that the model suffered a high bias, and hence I increased the complexity of the model to use 300 hidden nodes in the first layer, but still the problem persists.
I would greatly appreciate pointers on what could be going on
The tutorial slides for VW mentions:
"If you test on the train set, does it work? (no
=> something crazy)"
So something very crazy seems to be happening and I am trying to understand where I should dig deeper possibly.
More details:
I am using vowpal wabbit for a named entity recognition task where features are word representations. I am trying several models using neural networks with multiple hidden units and trying to evaluate the model. However all of my trained models exhibit high average loss when tested on the training data itself which I find very odd.
Here is one way to reproduce the problem:
Output of training:
vw -d ~/embeddings/eng_train_4.vw --loss_function logistic --oaa 6 --nn 32 -l 10 --random_weights 1 -f test_3.model --passes 4 -c
final_regressor = test_3.model
Num weight bits = 18
learning rate = 10
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = /home/vvkulkarni/embeddings/eng_train_4.vw.cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.666667 0.666667 3 3.0 1 1 577
0.833333 1.000000 6 6.0 1 2 577
0.818182 0.800000 11 11.0 4 4 577
0.863636 0.909091 22 22.0 1 4 577
0.636364 0.409091 44 44.0 1 1 577
0.390805 0.139535 87 87.0 1 1 577
0.258621 0.126437 174 174.0 1 1 577
0.160920 0.063218 348 348.0 1 1 577
0.145115 0.129310 696 696.0 1 1 577
0.138649 0.132184 1392 1392.0 1 1 577
0.122486 0.106322 2784 2784.0 1 1 577
0.097522 0.072557 5568 5568.0 1 1 577
0.076875 0.056224 11135 11135.0 1 1 577
0.058647 0.040417 22269 22269.0 1 1 577
0.047803 0.036959 44537 44537.0 1 1 577
0.038934 0.030066 89073 89073.0 1 1 577
0.036768 0.034601 178146 178146.0 1 1 577
0.032410 0.032410 356291 356291.0 1 1 577 h
0.029782 0.027155 712582 712582.0 1 1 577 h
finished run
number of examples per pass = 183259
passes used = 4
weighted example sum = 733036
weighted label sum = 0
average loss = 0.0276999
best constant = 0
total feature number = 422961744
Now when I evaluate the model above using the same data (used for training)
vw -t ~/embeddings/eng_train_4.vw -i test_3.model -p test_3.pred
only testing
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = test_3.pred
using no cache
Reading datafile = /home/vvkulkarni/embeddings/eng_train_4.vw
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.333333 0.333333 3 3.0 1 1 577
0.500000 0.666667 6 6.0 1 4 577
0.636364 0.800000 11 11.0 6 3 577
0.590909 0.545455 22 22.0 1 1 577
0.500000 0.409091 44 44.0 4 1 577
0.482759 0.465116 87 87.0 1 1 577
0.528736 0.574713 174 174.0 1 3 577
0.500000 0.471264 348 348.0 1 3 577
0.517241 0.534483 696 696.0 6 1 577
0.536638 0.556034 1392 1392.0 4 4 577
0.560345 0.584052 2784 2784.0 1 5 577
0.560884 0.561422 5568 5568.0 6 2 577
0.586349 0.611820 11135 11135.0 1 1 577
0.560914 0.535477 22269 22269.0 1 1 577
0.557200 0.553485 44537 44537.0 1 1 577
0.568938 0.580676 89073 89073.0 1 2 577
0.560568 0.552199 178146 178146.0 1 1 577
finished run
number of examples per pass = 203621
passes used = 1
weighted example sum = 203621
weighted label sum = 0
average loss = 0.557428 <<< This is what is tricky.
best constant = -4.91111e-06
total feature number = 117489309
Things I have tried:
1.I tried increasing the number of hidden nodes to 600 but to no avail.
2.I also tried using quadratic features with 300 hidden nodes but that did not help either.
The rationale behind trying 1.) and 2.) was to increase model complexity assuming that high training error was due to high bias.
Update:
Even more intrestingsly, if I however specify the number of passes to be 4 in the testing phase (even though I assumed the model would have learnt a decision boundary), then the problem goes away. I am trying to understand why ?
vvkulkarni#einstein:/scratch1/vivek/test$ vw -t ~/embeddings/eng_train_4.vw -i test_3.model -p test_3_1.pred --passes 4 -c
only testing
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
predictions = test_3_1.pred
using cache_file = /home/vvkulkarni/embeddings/eng_train_4.vw.cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.333333 0.333333 3 3.0 1 1 577
0.166667 0.000000 6 6.0 1 1 577
0.090909 0.000000 11 11.0 4 4 577
0.045455 0.000000 22 22.0 1 1 577
0.022727 0.000000 44 44.0 1 1 577
0.011494 0.000000 87 87.0 1 1 577
0.017241 0.022989 174 174.0 1 1 577
0.022989 0.028736 348 348.0 1 1 577
0.020115 0.017241 696 696.0 1 1 577
0.043822 0.067529 1392 1392.0 1 1 577
0.031968 0.020115 2784 2784.0 1 1 577
0.031968 0.031968 5568 5568.0 1 1 577
0.032959 0.033950 11135 11135.0 1 1 577
0.029952 0.026944 22269 22269.0 1 1 577
0.029212 0.028471 44537 44537.0 1 1 577
0.030481 0.031750 89073 89073.0 1 1 577
0.028673 0.026866 178146 178146.0 1 1 577
0.034001 0.034001 356291 356291.0 1 1 577 h
0.034026 0.034051 712582 712582.0 1 1 577 h
You have hash collisions because you have many more features than you have spaces in the hash.
The default hash size is 18 bits or 262144 spaces. According to your first printout, there are 422961744 features which at a minimum requires 27 bits so you should add -b27 (or more) to your command line.
I don't have your input file so I cannot try it and see.. but here is one way to check for collisions:
Run your learning phase and add --invert_hash final
then check collisions with these lines:
tail -n +13 final | sort -n -k 2 -t ':' | wc -l
tail -n +13 final | sort -nu -k 2 -t ':' | wc -l
The values output should be the same. I got this tip from John Langford, creator of Vowpal Wabbit.

Resources