data visualization. 3D, precison, recall, and f-measure. maybe using ocatve? - machine-learning

I've been running a machine learning algorithm, I have output in the form of Precision, Recall, and F-Measure.
I'd like to graph this data so I can get a clearer conception of how things are really going, but I don't really know how to do that. I suppose I can use Octave? I heard about it in that Andrew Ng course and I've already got it on my machine, but I don't really know how to use it to visualize data.
Does anyone with experience in this know how I might best proceed or some helpful resources on the best way to go about this?
0.011723329425556858 P 0.6000000238418579 R 0.010416666977107525 F1 0.02047781631341665
0.012895662368112544 P 0.6363636255264282 R 0.01215277798473835 F1 0.023850085569817648
0.01406799531066823 P 0.6666666865348816 R 0.013888888992369175 F1 0.027210884568890845
0.015240328253223915 P 0.6153846383094788 R 0.013888888992369175 F1 0.02716468612858015
0.016412661195779603 P 0.6428571343421936 R 0.015625 F1 0.03050847456668239
0.017584994138335287 P 0.6000000238418579 R 0.015625 F1 0.03045685282259509
0.01875732708089097 P 0.5625 R 0.015625 F1 0.030405405405405407
0.01992966002344666 P 0.529411792755127 R 0.015625 F1 0.030354131580674088
0.021101992966002344 P 0.5555555820465088 R 0.0173611119389534 F1 0.03367003527554599
0.022274325908558032 P 0.5263158082962036 R 0.0173611119389534 F1 0.03361344696816966
0.023446658851113716 P 0.5 R 0.0173611119389534 F1 0.033557048526295
0.0246189917936694 P 0.4761904776096344 R 0.0173611119389534 F1 0.03350083906570289

I suppose the first column is some threshold you varied between lines.
The precision-recall graph is precision-vs-recall. Thus we can first retrieve those two columns from your data: (suppose your data are saved in prf.data).
cat prf.data | awk '{print $3,$5}'
You will get below two columns only and you can initialize a 2d matrix in octave:
data = [
0.6000000238418579 0.010416666977107525
0.6363636255264282 0.01215277798473835
0.6666666865348816 0.013888888992369175
0.6153846383094788 0.013888888992369175
0.6428571343421936 0.015625
0.6000000238418579 0.015625
0.5625 0.015625
0.529411792755127 0.015625
0.5555555820465088 0.0173611119389534
0.5263158082962036 0.0173611119389534
0.5 0.0173611119389534
0.4761904776096344 0.0173611119389534];
Then under octave, below command will print each row as a data point in the graph:
plot(data(:,2), data(:,1), 'x')
ylabel('precision')
xlabel('recall')
Looks like with some threshold increase, you are decreasing precision and the recall stays the same (for example, when threshold = 0.021, 0.022, 0.023, 0.024).

Related

Use the Survey package to weight observations in stacked imputations

I am exploring model variable selection within imputed data.
One technique is to stack imputations in long format (where n observations in M imputed datasets creates a dataset n x M long), and use weighted regression to reduce the contribution of each observation proportionally to the number of imputations. If we treated the stacked dataset as one large dataset, the standard errors would be too small.
I am trying to use the weights argument in svyglm to account for the stacked data, resulting in SEs that you would expect with n obervations, rather than n x M observations.
To illustrate:
library(mice)
### create data
set.seed(42)
n <- 50
id <- 1:n
var1 <- rbinom(n,1,0.4)
var2 <- runif(n,30,80)
var3 <- rnorm(n, mean = 12, sd = 5)
var4 <- rnorm(n, mean = 100, sd = 20)
prob <- (((var1*var2)+var3)-min((var1*var2)+var3)) / (max((var1*var2)+var3)-min((var1*var2)+var3))
outcome <- rbinom(n, 1, prob = prob)
data <- data.frame(id, var1, var2, var3, var4, outcome)
### Add missingness
data_miss <- ampute(data)
patt <- data_miss$patterns
patt <- patt[2:5,]
data_miss <- ampute(data, patterns = patt)
data_miss <- data_miss$amp
## create 5 imputed datasets
nimp <- 5
imp <- mice(data_miss, m = nimp)
## Stack data
data_long <- complete(imp, action = "long")
## Generate model in stacked data (SEs will be too small)
modlong <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = data_long)
summary(modlong)
the long data gives overly small SEs, as we've increased the size of our dataset by 5x
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.906417 0.965090 -3.012 0.0026 **
var1 2.221053 0.311167 7.138 9.48e-13 ***
var2 -0.002543 0.010468 -0.243 0.8081
var3 0.076955 0.032265 2.385 0.0171 *
var4 0.006595 0.008031 0.821 0.4115
Add weights
data_long$weight <- 1/nimp
library(survey)
des <- svydesign(ids = ~1, data = data_long, weights = ~weight)
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des)
summary(mod_svy)
The weighted regression gives similar SEs to the unweighted model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036691 -2.804 0.00546 **
var1 2.221053 0.310906 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
Adding rescale = F (to apparently stop weights being rescaled to the sum of the sample size) doesn't change anything
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des, rescale = F)
summary(mod_svy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036688 -2.804 0.00546 **
var1 2.221053 0.310905 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
I would have expected SEs similar to those obtained when running a model in a single imputed dataset
## Assess SEs in single imputation
mod_singleimp <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = complete(imp,1))
summary(mod_singleimp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.679589 2.116806 -1.266 0.20556
var1 2.476193 0.761195 3.253 0.00114 **
var2 0.014823 0.025350 0.585 0.55874
var3 0.048940 0.072752 0.673 0.50114
var4 -0.004551 0.017986 -0.253 0.80026
All assistance greatly appreciated. Or if anybody knows other ways of achieving the same goal.
Alternative options
the psfmi package allows for stepwise selection in multiply imputed datasets and pooling of models. However, it is computationally intensive and slow with large datasets, particularly if the process needs to be bootstrapped (e.g. during internal validation) - hence the requirement for a less intensive stacking approach.
Sorry, no, this isn't going to work.
To handle stacked imputation data with weights you need frequency weights, so that a weight of 1/10 means you have 1/10 of an observation. With svydesign you specify sampling weights, so that a weight of 1/10 means your observation represents 10 observations in the population. These will (and should) give different standard errors. Pretending you have frequency weights when you actually have imputations is a clever hack to avoid having software that understands what it's doing, which is fine but isn't compatible with survey, which understands what it's doing and is doing something different.
Currently,if you want to use svyglm with multiple imputations you need to compute the standard errors separately -- most conveniently with Rubin's rules using mitools::MIcombine, which is set up to work with the survey package (see the help for with.svyimputationList and withPV).
It might be worth putting in a feature request to the mitools or survey developers (with citations to examples) to allow for stacked analysis of imputations, but this isn't just a matter of adjusting the weights.

Subgraph isomorphism (or even set membership) in Z3?

I'm trying to find a way to encode a sort of basic subgraph isomorphism in Z3 (preferably z3py). While I know there are papers on this in the abstract, finding any mechanism to do it has eluded me even for very trivial cases, because I'm very new to Z3 in general!
Suppose you have just about the most basic subgraph with nodes (0,1,2) and edges (0,1) with node 2 off on its own, and the supergraph has nodes (0,1,2) and edges (1,2) with node 0 off on its own. You could map the nodes of the subgraph into the supergraph with
0->1,
1->2,
2->0
...as one possible mapping that would satisfy "if these two nodes are connected in the subgraph, their mapped nodes are connected in the supergraph"
So okay :) I tried
from networkx import Graph
from networkx.linalg.graphmatrix import adjacency_matrix
subgraph = Graph()
subgraph.add_nodes_from([0,1,2])
subgraph.add_edges_from([(0,1)])
supergraph = Graph()
supergraph.add_nodes_from([0,1,2])
supergraph.add_edges_from([(1,2)])
s = Solver()
assignments = [Int(f'n{node}') for node in subgraph.nodes]
# each bit assignment in the subgraph belongs to one in the supergraph
assignment_constraint = [ And(assignments[i] >= 0, assignments[i] <= max(supergraph.nodes)) for i in subgraph.nodes ]
# subgraph bits can't be assigned to the same supergraph bits
assignment_distinct = [ Distinct([assignments[i] for i in subgraph.nodes])]
which just gets me as far as "each assignment from subgraph to supergraph should map a node in the subgraph to some node in the supergraph and no two subgraph nodes can be assigned to the same supergraph node"
...but then I get stuck because I keep thinking along the lines of
for edge in subgraph.edges:
s.add( (assignments[edge[0]], assignments[edge[1]]) in supergraph.edges )
...but of course that doesn't work because pythonically those aren't the right sort of keys so that's always false or broken.
So how does one approach that? I can add constraints like "this_var == 1" but get very confused on things like checking membership, ie
>>> assignments[0] == 1.0
n0 == 1 # so that's OK then
>>> assignments[0] in [1.0, 2.0, 3.0]
False # woops, that fails horribly
and I feel like I'm missing a very basic "frame of mind" thing here.
It is relatively straightforward to encode subgraph isomorphism in z3, pretty much along the lines of how you described. However, this encoding is unlikely to scale to large graphs. As you no doubt know, subgraph isomorphism is NP-complete in general, and this encoding will cause z3 to simply enumerate all possibilities and thus will blow up exponentially.
Having said that, here's a straightforward encoding:
from z3 import *
# Subgraph, number of nodes and edges.
# Nodes will be named implicitly from 0 to noOfNodesA - 1
noOfNodesA = 3
edgesA = [(0, 1)]
# Supergraph:
noOfNodesB = 3
edgesB = [(1, 2)]
# Mapping of subgraph nodes to supergraph nodes:
mapping = Array('Map', IntSort(), IntSort())
s = Solver()
# Check that elt is between low and high, inclusive
def InRange(elt, low, high):
return And(low <= elt, elt <= high)
# Check that (x, y) is in the list
def Contains(x, y, lst):
return Or([And(x == x1, y == y1) for x1, y1 in lst])
# Make sure mapping is into the supergraph
s.add(And([InRange(Select(mapping, n1), 0, noOfNodesB-1) for n1 in range(noOfNodesA)]))
# Make sure we map nodes to distinct nodes
s.add(Distinct([Select(mapping, n1) for n1 in range(noOfNodesA)]))
# Make sure edges are preserved:
for x, y in edgesA:
s.add(Contains(Select(mapping, x), Select(mapping, y), edgesB))
# Solve:
r = s.check()
if r == sat:
m = s.model()
for x in range(noOfNodesA):
print ("%s -> %s" % (x, m.evaluate(Select(mapping, x))))
else:
print ("Solver said: %s" % r)
I've added comments along the way, so hopefully you should be able to read the code through; feel free to ask specific questions.
When I run this, I get:
$ python a.py
0 -> 1
1 -> 2
2 -> 0
which finds exactly the mapping you alluded to in your question.
Best of luck!

simple way to tell if MST will improve if a specific edge cost is reduced?

G is an undirected connected graph with positive costs on all edges. Given is edge e whose cost is strictly more than 10. We need to answer whether the MST cost will improve if the cost of e is reduced by 10.
I know of a solution that involves generating a new graph with only edges with cost<cost(e)-10. What's wrong with this much simpler solution:
Take one of e's vertices v. Find the minimal cost edge incident to v. Now reduce e's cost and find the minimal cost edge incident to v again. If there was a change, it means that prim would find a better MST and the cost is improved. If not, it means that prim would find the same MST and the cost stays the same.
What's wrong with this logic?
related to Update minimum spanning tree with modification of edge
I don't think that your solution is correct.
Consider the following graph G = (V, E), V = {a, b, c, d, e}, E = {ab, bc, cd, de, ae, bd} and the respective weights are {5, 10, 10, 5, 17}.
By running Kruskal or Prim, we find that our MST is {ab, bc, cd, de}, and his weight is 30.
Now, let's reduce the weight of the edge bd from 17 to 7, and examine the edges again.
Running Prim or Kruskal with G' will output an MST which weighs 27 (actually we have 2 such MSTs {ab, bd, de, cd} and {ab, bd, de, bc}).
But if we use your algorithm, we would get the same exact tree, because when we examine the nodes b or d, the edge bd is not the lightest edge that is adjacent to either one of these nodes.
Let's G = (V, E) be a graph.
Definition
where w(<u,v>) is the weight of <u,v>.
Lemma 1
Let's G be a graph, v a vertex of G and e an edge of G incident to v. If
w(e) = C(v) then e belongs to some MST of G.
It's true that if C(v) value is altered when e's cost is reduced by 10 then the MST cost will improve if the cost of e is reduced by 10 by lemma 1.
First half is ok. Let's take a look to second part.
If not, it means that prim would find the same MST and the cost stays the same.
General explanation
The aforementioned quote falsely implies that the converse of lemma 1 is true (e belongs to some MST of G then w(e) = C(v)) since it claims that if we reduce e's cost by 10 and w(e) != C(v) then MST cost is preserved which implies that e doesn't belong to any MST of G.
Short explanation: a counterexample
Let's G = ({1, 2, 3, 4}, {<1, 2>, <1, 3>, <2, 4>, <3, 4>, <1, 4>}) with weight function w(<1, 2>) = 1, w(<1, 3>) = 3, w(<2, 4>) = 3, w(<3, 4>) = 1, w(<1, 4>) = 12 and e = <1, 4>.
After reducing e's cost we know that C(1) = C(4) = 1 != w(e). Proposed algorithm state that: "prim would find the same MST and the cost stays the same".
Let's check if there is a decrease in G's MST cost when the cost of e is reduced by 10:
MST cost before reducing the cost of e by 10: 5
MST cost after reducing the cost of e by 10: 4
Since there is a decrease in the MST cost then such claim (quoted one) is false and proposed algorithm doesn't work.
Note: The algorithm is wrong no matter which MST algorithm is used as the counterproof relies only on MST properties.

Logistic regression using Flux.jl

I have a dataset consisting of student marks in 2 subjects and the result if the student is admitted in college or not. I need to perform a logistic regression on the data and find the optimum parameter θ to minimize the loss and predict the results for the test data. I am not trying to build any complex non linear network here.
The data looks like this
I have the loss function defined for logistic regression like this which works fine
predict(X) = sigmoid(X*θ)
loss(X,y) = (1 / length(y)) * sum(-y .* log.(predict(X)) .- (1 - y) .* log.(1 - predict(X)))
I need to minimize this loss function and find the optimum θ. I want to do it with Flux.jl or any other library which makes it even easier.
I tried using Flux.jl after reading the examples but not able to minimize the cost.
My code snippet:
function update!(ps, η = .1)
for w in ps
w.data .-= w.grad .* η
print(w.data)
w.grad .= 0
end
end
for i = 1:400
back!(L)
update!((θ, b))
#show L
end
You can use either GLM.jl (simpler) or Flux.jl (more involved but more powerful in general).
In the code I generate the data so that you can check if the result is correct. Additionally I have a binary response variable - if you have other encoding of target variable you might need to change the code a bit.
Here is the code to run (you can tweak the parameters to increase the convergence speed - I chose ones that are safe):
using GLM, DataFrames, Flux.Tracker
srand(1)
n = 10000
df = DataFrame(s1=rand(n), s2=rand(n))
df[:y] = rand(n) .< 1 ./ (1 .+ exp.(-(1 .+ 2 .* df[1] .+ 0.5 .* df[2])))
model = glm(#formula(y~s1+s2), df, Binomial(), LogitLink())
x = Matrix(df[1:2])
y = df[3]
W = param(rand(2,1))
b = param(rand(1))
predict(x) = 1.0 ./ (1.0+exp.(-x*W .- b))
loss(x,y) = -sum(log.(predict(x[y,:]))) - sum(log.(1 - predict(x[.!y,:])))
function update!(ps, η = .0001)
for w in ps
w.data .-= w.grad .* η
w.grad .= 0
end
end
i = 1
while true
back!(loss(x,y))
max(maximum(abs.(W.grad)), abs(b.grad[1])) > 0.001 || break
update!((W, b))
i += 1
end
And here are the results:
julia> model # GLM result
StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ 1 + s1 + s2
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 0.910347 0.0789283 11.5338 <1e-30
s1 2.18707 0.123487 17.7109 <1e-69
s2 0.556293 0.115052 4.83513 <1e-5
julia> (b, W, i) # Flux result with number of iterations needed to converge
(param([0.910362]), param([2.18705; 0.556278]), 1946)
Thanks for this helpful example. However, it does not seem to run with my setup (Julia 1.1, Flux 0.7.1.), since the 1+ and 1- operations in the predict and loss function is not broadcast over the TrackedArray objects. Fortunately the fix is simple (note the dots!):
predict(x) = 1.0 ./ (1.0 .+ exp.(-x*W .- b))
loss(x,y) = -sum(log.(predict(x[y,:]))) - sum(log.(1 .- predict(x[.!y,:])))

Stata: estadd-weighted dependant var mean (ysumm)

I want to add a row for listing the weighted mean of the dependent variable at the bottom of a regression table. Normally, I would run
reg y x1 x2 x3
estadd ysumm, mean
eststo r1
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N ymean, labels("R-squared" "Observations" "Mean of Y"))
However, I have tried two ways to get the weighted mean without success.
First:
reg y x1 x2 x3
estadd ysumm [aw=pop], mean
and I get the error:
weights not allowed
r(101);
Second, I manually enter the weighted means into a matrix and then save it with estadd:
matrix define wtmeans=(mean1, mean2, mean3)
estadd matrix wtmeans
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N wtmeans, labels("R-squared" "Observations" "Mean of Y"))
The resulting tex file includes the label "Mean of Y", but the row is blank.
How can I get those weighted means to appear in the tex table?
I had a similar problem to solve today. Part of the solution is to use a scalar command and then refer to that matrix of scalars in the esttab, stat() option.
Here's the syntax I am using for a similar problem. It may be slightly different for you since you're pulling a different scalar (I am grabbing p-values for a specific joint F-test), but in essence it should be the same:
eststo clear
eststo ALL: reg treatment var1 var2 var3 var4 if experiment
qui test var1 var2 var3
estadd scalar pvals=r(p)
...repeat for other specifications...
esttab _all using filename.csv, replace se r2 ar2 pr2 stat(pvals) star( + .1 ++ .05 +++ .01) b(%9.3f) se(%9.3f) drop(o.*) label indicate()
So you could do the following:
eststo clear
eststo r1: reg y x1 x2 x3
qui sum y [aw=pop]
estadd scalar YwtdMean=r(mean)
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N YwtdMean, labels("R-squared" "Observations" "Weighted Mean of Y"))
Let me know if this works.

Resources