How to get solution report from lp_select gem (lpsolve) - ruby-on-rails

Thank you for your time.
I couldn't find how to get variables values after the solution.
Make a three row five column equation
#lp = LPSolve::make_lp(3, 5)
Set some column names
LPSolve::set_col_name(#lp, 1, "fred")
LPSolve::set_col_name(#lp, 2, "bob")
Add a constraint and a row name, the API expects a 1 indexed array
constraint_vars = [0, 0, 1]
FFI::MemoryPointer.new(:double, constraint_vars.size) do |p|
p.write_array_of_double(constraint_vars)
LPSolve::add_constraint(#lp, p, LPSelect::EQ, 1.0.to_f)
end
LPSolve::set_row_name(#lp, 1, "onlyBob")
Set the objective function and minimize it
constraint_vars = [0, 1.0, 3.0]
FFI::MemoryPointer.new(:double, constraint_vars.size) do |p|
p.write_array_of_double(constraint_vars)
LPSolve::set_obj_fn(#lp, p)
end
LPSolve::set_minim(#lp)
Solve it and retreive the result
LPSolve::solve(#lp)
#objective = LPSolve::get_objective(#lp)
Output
Model name: '' - run #1
Objective: Minimize(R0)
SUBMITTED
Model size: 4 constraints, 5 variables, 1 non-zeros.
Sets: 0 GUB, 0 SOS.
Using DUAL simplex for phase 1 and PRIMAL simplex for phase 2.
The primal and dual simplex pricing strategy set to 'Devex'.
Optimal solution 3 after 1 iter.
Excellent numeric accuracy ||*|| = 0
MEMO: lp_solve version 5.5.0.15 for 64 bit OS, with 64 bit REAL
variables.
In the total iteration count 1, 0 (0.0%) were bound flips.
There were 0 refactorizations, 0 triggered by time and 0 by density.
... on average 1.0 major pivots per refactorization.
The largest [LUSOL v2.2.1.0] fact(B) had 5 NZ entries, 1.0x largest basis.
The constraint matrix inf-norm is 1, with a dynamic range of 1.
Time to load data was 0.031 seconds, presolve used 0.000 seconds,
... 0.000 seconds in simplex solver, in total 0.031 seconds. => 3.0

retvals = []
FFI::MemoryPointer.new(:double, 2) do |p|
err = LPSolve::get_variables(#lp, p)
retvals = p.get_array_of_double(0,2)
end
retvals[0]
retvals[1]
gives the solution.

Related

Poisson Distribution

I have a Poisson distribution that looks similar to the one below:
https://i0.wp.com/www.real-statistics.com/wp-content/uploads/2012/11/poisson-distribution-chart.png
I've been asked to find the mean, and then the three logical groups above and below the mean for a total of seven groups.
Were this a normal distribution where the min was 0, max was 12 and mean was 6, the logical groups might be:
-3: 1
-2: 2.666
-1: 4.333
0: 6
1: 7.666
2: 9.333
3: 11
But with a Poisson distribution (such as the image above), I would expect it to be more like:
-3: 0.625
-2: 1.25
-1: 1.875
0: 2.5
1: 4.25
2: 6.5
3: 10
Is there a faster way of looking for where these points would be than eyeballing it? I need to do this with more than a hundred histograms...
I apologize if I have the language wrong; this is my first time doing something like this.
Imagine that you need 7 bins that store the values you need.
For Poisson Distribution, the mean is the lambda itself, which in your case is 3. So bin[3] = 3
Consider the formula:
bins = []
for n = min to groups + min: # typically it is 0 to groups - 1
bins[n] = min + range * n / groups
Now you need 2 different ranges:
n = 0 to 2, min = 0, max = 3, range = (3 - 0) = 3, groups = 3
n = 4 to 6, min = 3, max = 12, range = (12 - 3) = 9, groups = 3
You may apply the values in above formula to get your bins.
HTH. My memory is little out of practice, but I think general idea is correct.
Edit: This might not work for Poisson distribution. Poisson is a Discrete type distribution while my solution works only for Continuous distributions. I will leave my answer here anyways.

Data Science: Scoring methodology

I am looking for any methodology to assign a risk score to an individual based on certain events. I am looking to have a 0-100 scale with an exponential assignment. For example, for one event a day the score may rise to 25, for 2 it may rise to 50-60 and for 3-4 events a day the score for the day would be 100.
I tried to Google it but since I am not aware of the right terminology, I am landing up on random topics. :(
Is there any mathematical terminology for this kind of scoring system? what are the most common methods you might know?
P.S.: Expert/experience data scientist advice highly appreciated ;)
I would start by writing some qualifications:
0 events trigger a score of 0.
Non edge event count observations are where the score – 100-threshold would live.
Any score after the threshold will be 100.
If so, here's a (very) simplified example:
Stage Data:
userid <- c("a1","a2","a3","a4","a11","a12","a13","a14","u2","wtf42","ub40","foo","bar","baz","blue","bop","bob","boop","beep","mee","r")
events <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,2,3,6,122,13,1)
df1 <- data.frame(userid,events)
Optional: Normalize events to be in (1,2].
This might be helpful for logarithmic properties. (Otherwise, given the assumed function, score=events^exp, as in this example, 1 event will always yield a score of 1) This will allow you to control sensitivity, but it must be done right as we are dealing with exponents and logarithms. I am not using normalization in the example:
normevents <- (events-mean(events))/((max(events)-min(events))*2)+1.5
Set the quantile threshold for max score:
MaxScoreThreshold <- 0.25
Get the non edge quintiles of the events distribution:
qts <- quantile(events[events>min(events) & events<max(events)], c(seq(from=0, to=100,by=5)/100))
Find the Events quantity that give a score of 100 using the set threshold.
MaxScoreEvents <- quantile(qts,MaxScoreThreshold)
Find the exponent of your exponential function
Given that:
Score = events ^ exponent
events is a Natural number - integer >0: We took care of it by
omitting the edges)
exponent > 1
Exponent Calculation:
exponent <- log(100)/log(MaxScoreEvents)
Generate the scores:
df1$Score <- apply(as.matrix(events^exponent),1,FUN = function(x) {
if (x > 100) {
result <- 100
}
else if (x < 0) {
result <- 0
}
else {
result <- x
}
return(ceiling(result))
})
df1
Resulting Data Frame:
userid events Score
1 a1 0 0
2 a2 0 0
3 a3 0 0
4 a4 0 0
5 a11 0 0
6 a12 0 0
7 a13 0 0
8 a14 0 0
9 u2 0 0
10 wtf42 0 0
11 ub40 0 0
12 foo 0 0
13 bar 1 1
14 baz 2 100
15 blue 3 100
16 bop 2 100
17 bob 3 100
18 boop 6 100
19 beep 122 100
20 mee 13 100
21 r 1 1
Under the assumption that your data is larger and has more event categories, the score won't snap to 100 so quickly, it is also a function of the threshold.
I would rely more on the data to define the parameters, threshold in this case.
If you have prior data as to what users really did whatever it is your score assess you can perform supervised learning, set the threshold # wherever the ratio is over 50% for example. Or If the graph of events to probability of ‘success’ looks like the cumulative probability function of a normal distribution, I’d set threshold # wherever it hits 45 degrees (For the first time).
You could also use logistic regression if you have prior data but instead of a Logit function ingesting the output of regression, use the number as your score. You can normalize it to be within 0-100.
It’s not always easy to write a Data Science question. I made many assumptions as to what you are looking for, hope this is the general direction.

How to predict users' preferences using item similarity?

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20&plus;%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?
Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.
What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

Dividing data sets into testing and training data

I have a dataset with k examples and I want to partition into m sets.
How can I do it programmatically.
For example, if k = 5 and m = 2, therefore, 5 / 2 = 2.5
How do I partition it into the 2 and 3, and not 2, 2 and 1?
Similarly, if k = 10 and m = 3, I want it to be partitioned into 3, 3 and 4, but not 3, 3, 3 and 1.
Usually, this sort of functionality is built into tools. But, assuming that your observations are independent, just set up a random number generator and do something like:
for i = 1 to k do;
set r = rand();
if r < 0.5 then data[i].which = 'set1'
else data[i].which = 'set2'
You can extend this for any number of sets and probabilities.
For an example where k = 5, then you could actually get all the rows in a single set (I'm thinking about 3% of the time). However, the point of splitting data is for dealing with larger amounts of data. If you only have 5 or 10 rows, then splitting your observations into different partitions is probably not the way to go.

Difference-in-difference analysis in SPSS

I am trying to compare means of the two groups 'single mothers with one child' and 'single mothers with more than one child' before and after the reform of the EITC system in 1993.
Through the procedure T-test in SPSS, I can get the difference between groups before and after the reform. But how do I get the difference of the difference (I still want standard errors)?
I found these methods for STATA and R (http://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/), but I can't seem to figure it out in SPSS.
Hope someone will be able to help.
All the best,
Anne
This can be done with the GENLIN procedure. Here's some random data I generated to show how:
data list list /after oneChild value.
begin data.
0 1 12
0 1 12
0 1 11
0 1 13
0 1 11
1 1 10
1 1 9
1 1 8
1 1 9
1 1 7
0 0 16
0 0 16
0 0 18
0 0 15
0 0 17
1 0 6
1 0 6
1 0 5
1 0 5
1 0 4
end data.
dataset name exampleData WINDOW=front.
EXECUTE.
value labels after 0 'before' 1 'after'.
value labels oneChild 0 '>1 child' 1 '1 child'.
The mean for the groups (in order, before I truncated to integers) are 17, 6, 12, and 9 respectively. So our GENLIN procedure should generate values of -11 (the after-before difference in the >1 child group), -5 (the difference of 1 child - >1 child), and 8 (the child difference of the after-before differences).
To graph the data, just so you can see what we're expecting:
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=after value oneChild MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: after=col(source(s), name("after"), unit.category())
DATA: value=col(source(s), name("value"))
DATA: oneChild=col(source(s), name("oneChild"), unit.category())
GUIDE: axis(dim(2), label("value"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label(""))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(smooth.linear(after*value)), color.interior(oneChild))
ELEMENT: point.dodge.symmetric(position(after*value), color.interior(oneChild))
END GPL.
Now, for the GENLIN:
* Generalized Linear Models.
GENLIN value BY after oneChild (ORDER=DESCENDING)
/MODEL after oneChild after*oneChild INTERCEPT=YES
DISTRIBUTION=NORMAL LINK=IDENTITY
/CRITERIA SCALE=MLE COVB=MODEL PCONVERGE=1E-006(ABSOLUTE) SINGULAR=1E-012 ANALYSISTYPE=3(WALD)
CILEVEL=95 CITYPE=WALD LIKELIHOOD=FULL
/MISSING CLASSMISSING=EXCLUDE
/PRINT CPS DESCRIPTIVES MODELINFO FIT SUMMARY SOLUTION.
The results table shows just what we expect.
The >1 child group is 12.3 - 10.1 lower after vs. before. This 95% CI contains the "real" value of 11
The before difference between >1 children and 1 child is 5.7 - 3.5, containing the real value of 5
The difference-of-differences is 9.6 - 6.4, containing the real value of (17-6) - (12-9) = 8
Std. errors, p values, and the other hypothesis testing values are all reported as well. Hope that helps.
EDIT: this can be done with less "complicated" syntax by computing the interaction term yourself and doing simple linear regression:
compute interaction = after*onechild.
execute.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT value
/METHOD=ENTER after oneChild interaction.
Note that the resulting standard errors and confidence intervals are actually different from the previous method. I don't know enough about SPSS's GENLIN and REGRESSION procedures to tell you why that's the case. In this contrived example, the conclusion you'd draw from your data would be approximately the same. In real life, the data aren't likely to be this clean, so I don't know which method is "better".
General Linear model, i take it as a 'ANOVA' model.
So use the related module in SPSS's Analyze menu.
After T-test, you need to check the sigma equality of each group .
Regarding the first answer above:
* Note that GENLIN uses maximum likelihood estimation (MLE) whereas REGRESSION
* uses ordinary least squares (OLS). Therefore, GENLIN reports z- and Chi-square tests
* where REGRESSION reports t- and F-tests. Rather than using GENLIN, use UNIANOVA
* to get the same results as REGRESSION, but without the need to compute your own
* product term.
UNIANOVA value BY after oneChild
/PLOT=PROFILE(after*oneChild)
/PLOT=PROFILE(oneChild*after)
/PRINT PARAMETER
/EMMEANS=TABLES(after*oneChild) COMPARE(after)
/EMMEANS=TABLES(after*oneChild) COMPARE(oneChild)
/DESIGN=after oneChild after*oneChild.
HTH.

Resources