Poisson Distribution - histogram

I have a Poisson distribution that looks similar to the one below:
https://i0.wp.com/www.real-statistics.com/wp-content/uploads/2012/11/poisson-distribution-chart.png
I've been asked to find the mean, and then the three logical groups above and below the mean for a total of seven groups.
Were this a normal distribution where the min was 0, max was 12 and mean was 6, the logical groups might be:
-3: 1
-2: 2.666
-1: 4.333
0: 6
1: 7.666
2: 9.333
3: 11
But with a Poisson distribution (such as the image above), I would expect it to be more like:
-3: 0.625
-2: 1.25
-1: 1.875
0: 2.5
1: 4.25
2: 6.5
3: 10
Is there a faster way of looking for where these points would be than eyeballing it? I need to do this with more than a hundred histograms...
I apologize if I have the language wrong; this is my first time doing something like this.

Imagine that you need 7 bins that store the values you need.
For Poisson Distribution, the mean is the lambda itself, which in your case is 3. So bin[3] = 3
Consider the formula:
bins = []
for n = min to groups + min: # typically it is 0 to groups - 1
bins[n] = min + range * n / groups
Now you need 2 different ranges:
n = 0 to 2, min = 0, max = 3, range = (3 - 0) = 3, groups = 3
n = 4 to 6, min = 3, max = 12, range = (12 - 3) = 9, groups = 3
You may apply the values in above formula to get your bins.
HTH. My memory is little out of practice, but I think general idea is correct.
Edit: This might not work for Poisson distribution. Poisson is a Discrete type distribution while my solution works only for Continuous distributions. I will leave my answer here anyways.

Related

How to get solution report from lp_select gem (lpsolve)

Thank you for your time.
I couldn't find how to get variables values after the solution.
Make a three row five column equation
#lp = LPSolve::make_lp(3, 5)
Set some column names
LPSolve::set_col_name(#lp, 1, "fred")
LPSolve::set_col_name(#lp, 2, "bob")
Add a constraint and a row name, the API expects a 1 indexed array
constraint_vars = [0, 0, 1]
FFI::MemoryPointer.new(:double, constraint_vars.size) do |p|
p.write_array_of_double(constraint_vars)
LPSolve::add_constraint(#lp, p, LPSelect::EQ, 1.0.to_f)
end
LPSolve::set_row_name(#lp, 1, "onlyBob")
Set the objective function and minimize it
constraint_vars = [0, 1.0, 3.0]
FFI::MemoryPointer.new(:double, constraint_vars.size) do |p|
p.write_array_of_double(constraint_vars)
LPSolve::set_obj_fn(#lp, p)
end
LPSolve::set_minim(#lp)
Solve it and retreive the result
LPSolve::solve(#lp)
#objective = LPSolve::get_objective(#lp)
Output
Model name: '' - run #1
Objective: Minimize(R0)
SUBMITTED
Model size: 4 constraints, 5 variables, 1 non-zeros.
Sets: 0 GUB, 0 SOS.
Using DUAL simplex for phase 1 and PRIMAL simplex for phase 2.
The primal and dual simplex pricing strategy set to 'Devex'.
Optimal solution 3 after 1 iter.
Excellent numeric accuracy ||*|| = 0
MEMO: lp_solve version 5.5.0.15 for 64 bit OS, with 64 bit REAL
variables.
In the total iteration count 1, 0 (0.0%) were bound flips.
There were 0 refactorizations, 0 triggered by time and 0 by density.
... on average 1.0 major pivots per refactorization.
The largest [LUSOL v2.2.1.0] fact(B) had 5 NZ entries, 1.0x largest basis.
The constraint matrix inf-norm is 1, with a dynamic range of 1.
Time to load data was 0.031 seconds, presolve used 0.000 seconds,
... 0.000 seconds in simplex solver, in total 0.031 seconds. => 3.0
retvals = []
FFI::MemoryPointer.new(:double, 2) do |p|
err = LPSolve::get_variables(#lp, p)
retvals = p.get_array_of_double(0,2)
end
retvals[0]
retvals[1]
gives the solution.

How to predict users' preferences using item similarity?

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20+%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?
Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.
What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

vowpalwabbit strange features count

I have found that during training my model vw shows very big (much more than my features count ) feature number count in it's log.
I have tried to reproduce it using some small example:
simple.test:
-1 | 1 2 3
1 | 3 4 5
then "vw simple.test" command says that it have used 8 features. +one feature is constant but what are the other ? And in my real exmaple difference between my features and features used in wv is abot x10 more.
....
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = t
num sources = 1
average since example example current current current
loss last counter weight label predict features
finished run
number of examples = 2
weighted example sum = 2
weighted label sum = 3
average loss = 1.9179
best constant = 1.5
total feature number = 8 !!!!
total feature number displays a sum of feature counts from all observed examples. So it's 2*(3+1 constant)=8 in your case. The number of features in current example is displayed in current features column. Note that only 2^Nth example is printed on screen by default. In general observations can have unequal number of features.

Dividing data sets into testing and training data

I have a dataset with k examples and I want to partition into m sets.
How can I do it programmatically.
For example, if k = 5 and m = 2, therefore, 5 / 2 = 2.5
How do I partition it into the 2 and 3, and not 2, 2 and 1?
Similarly, if k = 10 and m = 3, I want it to be partitioned into 3, 3 and 4, but not 3, 3, 3 and 1.
Usually, this sort of functionality is built into tools. But, assuming that your observations are independent, just set up a random number generator and do something like:
for i = 1 to k do;
set r = rand();
if r < 0.5 then data[i].which = 'set1'
else data[i].which = 'set2'
You can extend this for any number of sets and probabilities.
For an example where k = 5, then you could actually get all the rows in a single set (I'm thinking about 3% of the time). However, the point of splitting data is for dealing with larger amounts of data. If you only have 5 or 10 rows, then splitting your observations into different partitions is probably not the way to go.

How to process % to negative number in Visual Foxpro

How to do % to negative number in VF?
MOD(10,-3) = -2
MOD(-10,3) = 2
MODE(-10,-3) = -1
Why?
It is a regular modulo:
The mod function is defined as the amount by which a number exceeds
the largest integer multiple of the divisor that is not greater than
that number.
You can think of it like this:
10 % -3:
The largest multiple of 10 that is less than -3 is -2.
So 10 % -3 is -2.
-10 % 3:
Now, why -10 % 3 is 2?
The easiest way to think about it is to add to the negative number a multiple of 2 so that the number becomes positive.
-10 + (4*3) = 2 so -10 % 3 = (-10 + 12) % 3 = 2 % 3 = 3
Here's what we said about this in The Hacker's Guide to Visual FoxPro:
MOD() and % are pretty straightforward when dealing with positive numbers, but they get interesting when one or both of the numbers is negative. The key to understanding the results is the following equation:
MOD(x,y) = x - (y * FLOOR(x/y))
Since the mathematical modulo operation isn't defined for negative numbers, it's a pleasure to see that the FoxPro definitions are mathematically consistent. However, they may be different from what you'd initially expect, so you may want to check for negative divisors or dividends.
A little testing (and the manuals) tells us that a positive divisor gives a positive result while a negative divisor gives a negative result.

Resources