I'm developing a package in Julia that uses Clp together with JuMP to solve a Simplex problem, here is a sample of the code:
model = JuMP.Model(Clp.Optimizer)
#variable(model, x[1:size(c)[1]])
#constraint(model,A*x.==b)
#constraint(model,x.>=0)
#objective(model, Min, c'*x)
optimize!(model)
The problem is, when using Clp, the code prints the iteration steps. Here is an example:
Coin0506I Presolve 500 (-62500) rows, 62500 (0) columns and 125000 (-62500) elements
Clp0006I 0 Obj 0 Primal inf 1.9995 (500)
Clp0006I 85 Obj 5.249611e-08 Primal inf 1.9070741 (461)
Clp0006I 170 Obj 1.3219003e-06 Primal inf 1.7932731 (424)
Clp0006I 255 Obj 2.1956446e-06 Primal inf 1.6079534 (387)
Clp0006I 338 Obj 4.6964461e-06 Primal inf 1.3793942 (354)
Clp0006I 423 Obj 5.8976838e-06 Primal inf 1.4504309 (331)
...
My question is, how can I suppress this without appealing to another package as "Suppressor.jl"?
Just set the LogLevel:
set_optimizer_attribute(model, "LogLevel", 0)
This will stop the logs to appear.
There was a bug in Clp.j v0.8.2 issues#1883 which is fix.
You just have to update to Clp.jl v0.8.3.
-- Maurice
Related
I am working on a dataset which has a feature that has multiple categories for a single example.
The feature looks like this:-
Feature
0 [Category1, Category2, Category2, Category4, Category5]
1 [Category11, Category20, Category133]
2 [Category2, Category9]
3 [Category1000, Category1200, Category2000]
4 [Category12]
The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn
Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.
Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.
Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.
Assuming the MultiLabelBinarizered 2000 features = X
from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)
And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.
If you want to understand how well the new smaller feature space describes the variance you could use the following command
model.explained_variance_
In many cases when I encountered the problem of too many features being generated from a column with many categories, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
This is the basic intuition behind Binary Encoder.
PS: Given that 2 power 11 is 2048 and you may have 2000 categories or so, you can reduce your categories to 11 feature columns instead of many (for example, 1999 in the case of one-hot)!
I also encountered these same problems but I solved using Countvectorizer from sklearn.feature_extraction.text just by giving binary=True, i.e CounterVectorizer(binary=True)
I run SVM-Light classifier but the recall/precision row it outputs seem to be corrupted:
Reading model...OK. (20 support vectors read)
Classifying test examples..100..200..done
Runtime (without IO) in cpu-seconds: 0.00
Accuracy on test set: 95.50% (191 correct, 9 incorrect, 200 total)
Precision/recall on test set: 0.00%/0.00%
What should I configure to get valid precision and recall?
For example, if your classifier is always predicting "-1" -- the negative class; your test dataset, however, contains 191 "-1" and 9 "+1" as golden labels, you will get 191 of them correctly classified and 9 of them incorrect.
True positives : 0 (TP)
True negatives : 191 (TN)
False negatives: 9 (FN)
False positives: 0 (FP)
Thus:
TP 0
Precision = ----------- = --------- = undefined
TP + FP 0 + 0
TP 0
Recall = ----------- = --------- = 0
TP + FN 0 + 9
From the formula above, you know that as long as your TP is zero, your precision/recall is either zero or undefined.
To debug, you should output (for each test example) the golden label and the predicted label so that you know where the issue is.
Thank you greeness. your answer helped me too.
To avoid this issue, make sure that the test and training datasets are chosen/grouped such that they have a fair mix of positive and negative values.
I'm new to WEKA and advanced statistics, starting from scratch to understand the WEKA measures. I've done all the #rushdi-shams examples, which are great resources.
On Wikipedia the http://en.wikipedia.org/wiki/Precision_and_recall examples explains with an simple example about a video software recognition of 7 dogs detection in a group of 9 real dogs and some cats.
I perfectly understand the example, and the recall calculation.
So my first step, let see in Weka how to reproduce with this data.
How do I create such a .ARFF file?
With this file I have a wrong Confusion Matrix, and the wrong Accuracy By Class
Recall is not 1, it should be 4/9 (0.4444)
#relation 'dogs and cat detection'
#attribute 'realanimal' {dog,cat}
#attribute 'detected' {dog,cat}
#attribute 'class' {correct,wrong}
#data
dog,dog,correct
dog,dog,correct
dog,dog,correct
dog,dog,correct
cat,dog,wrong
cat,dog,wrong
cat,dog,wrong
dog,?,?
dog,?,?
dog,?,?
dog,?,?
dog,?,?
cat,?,?
cat,?,?
Output Weka (without filters)
=== Run information ===
Scheme:weka.classifiers.rules.ZeroR
Relation: dogs and cat detection
Instances: 14
Attributes: 3
realanimal
detected
class
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: correct
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 4 57.1429 %
Incorrectly Classified Instances 3 42.8571 %
Kappa statistic 0
Mean absolute error 0.5
Root mean squared error 0.5044
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 7
Ignored Class Unknown Instances 7
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 1 0.571 1 0.727 0.65 correct
0 0 0 0 0 0.136 wrong
Weighted Avg. 0.571 0.571 0.327 0.571 0.416 0.43
=== Confusion Matrix ===
a b <-- classified as
4 0 | a = correct
3 0 | b = wrong
There must be something wrong with the False Negative dogs,
or is my ARFF approach totally wrong and do I need another kind of attributes?
Thanks
Lets start with the basic definition of Precision and Recall.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Where TP is True Positive, FP is False Positive, and FN is False Negative.
In the above dog.arff file, Weka took into account only the first 7 tuples, it ignored the remaining 7. It can be seen from the above output that it has classified all the 7 tuples as correct(4 correct tuples + 3 wrong tuples).
Lets calculate the precision for correct and wrong class.
First for the correct class:
Prec = 4/(4+3) = 0.571428571
Recall = 4/(4+0) = 1.
For wrong class:
Prec = 0/(0+0)= 0
recall =0/(0+3) = 0
I am trying to compare means of the two groups 'single mothers with one child' and 'single mothers with more than one child' before and after the reform of the EITC system in 1993.
Through the procedure T-test in SPSS, I can get the difference between groups before and after the reform. But how do I get the difference of the difference (I still want standard errors)?
I found these methods for STATA and R (http://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/), but I can't seem to figure it out in SPSS.
Hope someone will be able to help.
All the best,
Anne
This can be done with the GENLIN procedure. Here's some random data I generated to show how:
data list list /after oneChild value.
begin data.
0 1 12
0 1 12
0 1 11
0 1 13
0 1 11
1 1 10
1 1 9
1 1 8
1 1 9
1 1 7
0 0 16
0 0 16
0 0 18
0 0 15
0 0 17
1 0 6
1 0 6
1 0 5
1 0 5
1 0 4
end data.
dataset name exampleData WINDOW=front.
EXECUTE.
value labels after 0 'before' 1 'after'.
value labels oneChild 0 '>1 child' 1 '1 child'.
The mean for the groups (in order, before I truncated to integers) are 17, 6, 12, and 9 respectively. So our GENLIN procedure should generate values of -11 (the after-before difference in the >1 child group), -5 (the difference of 1 child - >1 child), and 8 (the child difference of the after-before differences).
To graph the data, just so you can see what we're expecting:
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=after value oneChild MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: after=col(source(s), name("after"), unit.category())
DATA: value=col(source(s), name("value"))
DATA: oneChild=col(source(s), name("oneChild"), unit.category())
GUIDE: axis(dim(2), label("value"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label(""))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(smooth.linear(after*value)), color.interior(oneChild))
ELEMENT: point.dodge.symmetric(position(after*value), color.interior(oneChild))
END GPL.
Now, for the GENLIN:
* Generalized Linear Models.
GENLIN value BY after oneChild (ORDER=DESCENDING)
/MODEL after oneChild after*oneChild INTERCEPT=YES
DISTRIBUTION=NORMAL LINK=IDENTITY
/CRITERIA SCALE=MLE COVB=MODEL PCONVERGE=1E-006(ABSOLUTE) SINGULAR=1E-012 ANALYSISTYPE=3(WALD)
CILEVEL=95 CITYPE=WALD LIKELIHOOD=FULL
/MISSING CLASSMISSING=EXCLUDE
/PRINT CPS DESCRIPTIVES MODELINFO FIT SUMMARY SOLUTION.
The results table shows just what we expect.
The >1 child group is 12.3 - 10.1 lower after vs. before. This 95% CI contains the "real" value of 11
The before difference between >1 children and 1 child is 5.7 - 3.5, containing the real value of 5
The difference-of-differences is 9.6 - 6.4, containing the real value of (17-6) - (12-9) = 8
Std. errors, p values, and the other hypothesis testing values are all reported as well. Hope that helps.
EDIT: this can be done with less "complicated" syntax by computing the interaction term yourself and doing simple linear regression:
compute interaction = after*onechild.
execute.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT value
/METHOD=ENTER after oneChild interaction.
Note that the resulting standard errors and confidence intervals are actually different from the previous method. I don't know enough about SPSS's GENLIN and REGRESSION procedures to tell you why that's the case. In this contrived example, the conclusion you'd draw from your data would be approximately the same. In real life, the data aren't likely to be this clean, so I don't know which method is "better".
General Linear model, i take it as a 'ANOVA' model.
So use the related module in SPSS's Analyze menu.
After T-test, you need to check the sigma equality of each group .
Regarding the first answer above:
* Note that GENLIN uses maximum likelihood estimation (MLE) whereas REGRESSION
* uses ordinary least squares (OLS). Therefore, GENLIN reports z- and Chi-square tests
* where REGRESSION reports t- and F-tests. Rather than using GENLIN, use UNIANOVA
* to get the same results as REGRESSION, but without the need to compute your own
* product term.
UNIANOVA value BY after oneChild
/PLOT=PROFILE(after*oneChild)
/PLOT=PROFILE(oneChild*after)
/PRINT PARAMETER
/EMMEANS=TABLES(after*oneChild) COMPARE(after)
/EMMEANS=TABLES(after*oneChild) COMPARE(oneChild)
/DESIGN=after oneChild after*oneChild.
HTH.
I am trying to get some data from an module which is a shared object wrapped with ctypes.
The data is a numeric array so I used numpy array to store the data. But I learned that I
don't understand how numpy organize the array in memory.
If I had a C function that would fill a array like below:
int filler(int* a,int length){
int i=0;
for(i=0;i<length;i++){
a[i]=i;
}
return 0;
}
Then I would call this function in python using ctypes
import ctypes
import numpy
lib = ctypes.cdll.LoadLibrary("libname")
data = numpy.zeros((1,10),dtype=numpy.int16)
lib.filler(data.ctypes.data,ctypes.c_int(10))
print data
But my output comes out this way.
dtype=numpy.int16
[[0 0 1 0 2 0 3 0 4 0]]
This would make sense if int was 32 bit, but I suppose C int are 16 bits (GCC in openSUSE in a x86 intel machine).
I tried running with dtypes being 32 bits and strangely I get the result I want:
dtype=numpy.int32
[[0 1 2 3 4 5 6 7 8 9]]
Trying to make sense of what is happening I ran with int8 and I got the following:
dtype=numpy.int8
[[0 0 0 0 1 0 0 0 2 0]]
I did give a look give a look at numpy docs, but so far I have not found what the answer.
This would make sense if int was 32
bit, but I suppose C int are 16 bits
(GCC in openSUSE in a x86 intel
machine). I tried running with dtypes
being 32 bits and strangely I get the
result I want:
Not strange at all: your supposition is wrong, and your machine is 32 bit with a 32 bit int and a 16 bit short int.. unless you're doing some (rather admirable) retrocomputing!
Check sizeof(int) and multiply by 8, or simply store numbers in an int and print them out, to convince yourself.