I want to add a row for listing the weighted mean of the dependent variable at the bottom of a regression table. Normally, I would run
reg y x1 x2 x3
estadd ysumm, mean
eststo r1
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N ymean, labels("R-squared" "Observations" "Mean of Y"))
However, I have tried two ways to get the weighted mean without success.
First:
reg y x1 x2 x3
estadd ysumm [aw=pop], mean
and I get the error:
weights not allowed
r(101);
Second, I manually enter the weighted means into a matrix and then save it with estadd:
matrix define wtmeans=(mean1, mean2, mean3)
estadd matrix wtmeans
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N wtmeans, labels("R-squared" "Observations" "Mean of Y"))
The resulting tex file includes the label "Mean of Y", but the row is blank.
How can I get those weighted means to appear in the tex table?
I had a similar problem to solve today. Part of the solution is to use a scalar command and then refer to that matrix of scalars in the esttab, stat() option.
Here's the syntax I am using for a similar problem. It may be slightly different for you since you're pulling a different scalar (I am grabbing p-values for a specific joint F-test), but in essence it should be the same:
eststo clear
eststo ALL: reg treatment var1 var2 var3 var4 if experiment
qui test var1 var2 var3
estadd scalar pvals=r(p)
...repeat for other specifications...
esttab _all using filename.csv, replace se r2 ar2 pr2 stat(pvals) star( + .1 ++ .05 +++ .01) b(%9.3f) se(%9.3f) drop(o.*) label indicate()
So you could do the following:
eststo clear
eststo r1: reg y x1 x2 x3
qui sum y [aw=pop]
estadd scalar YwtdMean=r(mean)
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N YwtdMean, labels("R-squared" "Observations" "Weighted Mean of Y"))
Let me know if this works.
Related
I am exploring model variable selection within imputed data.
One technique is to stack imputations in long format (where n observations in M imputed datasets creates a dataset n x M long), and use weighted regression to reduce the contribution of each observation proportionally to the number of imputations. If we treated the stacked dataset as one large dataset, the standard errors would be too small.
I am trying to use the weights argument in svyglm to account for the stacked data, resulting in SEs that you would expect with n obervations, rather than n x M observations.
To illustrate:
library(mice)
### create data
set.seed(42)
n <- 50
id <- 1:n
var1 <- rbinom(n,1,0.4)
var2 <- runif(n,30,80)
var3 <- rnorm(n, mean = 12, sd = 5)
var4 <- rnorm(n, mean = 100, sd = 20)
prob <- (((var1*var2)+var3)-min((var1*var2)+var3)) / (max((var1*var2)+var3)-min((var1*var2)+var3))
outcome <- rbinom(n, 1, prob = prob)
data <- data.frame(id, var1, var2, var3, var4, outcome)
### Add missingness
data_miss <- ampute(data)
patt <- data_miss$patterns
patt <- patt[2:5,]
data_miss <- ampute(data, patterns = patt)
data_miss <- data_miss$amp
## create 5 imputed datasets
nimp <- 5
imp <- mice(data_miss, m = nimp)
## Stack data
data_long <- complete(imp, action = "long")
## Generate model in stacked data (SEs will be too small)
modlong <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = data_long)
summary(modlong)
the long data gives overly small SEs, as we've increased the size of our dataset by 5x
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.906417 0.965090 -3.012 0.0026 **
var1 2.221053 0.311167 7.138 9.48e-13 ***
var2 -0.002543 0.010468 -0.243 0.8081
var3 0.076955 0.032265 2.385 0.0171 *
var4 0.006595 0.008031 0.821 0.4115
Add weights
data_long$weight <- 1/nimp
library(survey)
des <- svydesign(ids = ~1, data = data_long, weights = ~weight)
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des)
summary(mod_svy)
The weighted regression gives similar SEs to the unweighted model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036691 -2.804 0.00546 **
var1 2.221053 0.310906 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
Adding rescale = F (to apparently stop weights being rescaled to the sum of the sample size) doesn't change anything
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des, rescale = F)
summary(mod_svy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036688 -2.804 0.00546 **
var1 2.221053 0.310905 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
I would have expected SEs similar to those obtained when running a model in a single imputed dataset
## Assess SEs in single imputation
mod_singleimp <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = complete(imp,1))
summary(mod_singleimp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.679589 2.116806 -1.266 0.20556
var1 2.476193 0.761195 3.253 0.00114 **
var2 0.014823 0.025350 0.585 0.55874
var3 0.048940 0.072752 0.673 0.50114
var4 -0.004551 0.017986 -0.253 0.80026
All assistance greatly appreciated. Or if anybody knows other ways of achieving the same goal.
Alternative options
the psfmi package allows for stepwise selection in multiply imputed datasets and pooling of models. However, it is computationally intensive and slow with large datasets, particularly if the process needs to be bootstrapped (e.g. during internal validation) - hence the requirement for a less intensive stacking approach.
Sorry, no, this isn't going to work.
To handle stacked imputation data with weights you need frequency weights, so that a weight of 1/10 means you have 1/10 of an observation. With svydesign you specify sampling weights, so that a weight of 1/10 means your observation represents 10 observations in the population. These will (and should) give different standard errors. Pretending you have frequency weights when you actually have imputations is a clever hack to avoid having software that understands what it's doing, which is fine but isn't compatible with survey, which understands what it's doing and is doing something different.
Currently,if you want to use svyglm with multiple imputations you need to compute the standard errors separately -- most conveniently with Rubin's rules using mitools::MIcombine, which is set up to work with the survey package (see the help for with.svyimputationList and withPV).
It might be worth putting in a feature request to the mitools or survey developers (with citations to examples) to allow for stacked analysis of imputations, but this isn't just a matter of adjusting the weights.
I am using the community-contributed command estout to output a customized table from Stata to a latex .tex file. However, I do not know how I can add multiple columns in one table.
Below is a simplified example where I create two separate tables, each containing the standard deviations of the residuals from two different regressions:
reg y x1
predict res1, residual
reg y x2
predict res2, residual
reg y x3
predict res3, residual
reg y x4
predict res4, residual
eststo clear
estpost summarize res1 res2
eststo
esttab, cells("sd") noobs nonum
esttab using first.tex, cells("sd") noobs nonum replace
eststo clear
estpost summarize res3 res4
eststo
esttab, cells("sd") noobs nonum
esttab using second.tex, cells("sd") noobs nonum replace
However, I would like to have the two columns in the same table as follows:
sd(res1) sd(res3)
sd(res2) sd(res4)
Is Stata 14 capable of customizing a table like this?
This question is different from this question in that there I was looking for the command which creates customized tables. The answer was estpost. Now, I am asking for customization of this command in a way that I couldn't find in its documentation.
You need to create a matrix with the results and then configure estout's options accordingly:
sysuse auto, clear
regress price mpg
predict res1, residual
regress price length
predict res2, residual
regress price displacement
predict res3, residual
regress price headroom
predict res4, residual
matrix A = J(2, 2, 0)
local j = 0
forvalues i = 1 / 4 {
summarize res`i'
if `i' <= 2 matrix A[`i', 1] = r(sd)
else {
local ++j
matrix A[`j', 2] = r(sd)
}
}
esttab matrix(A), mlabels(sd) collabels(none) coeflabels(none)
--------------------------------------
sd
--------------------------------------
2605.621 2562.891
2660.311 2930.096
--------------------------------------
I understand well perceptron so put accent only on kernel but I am not familiar with matemathic expressions so please give me an numerical example and a guide on kernel.
For example:
My hyperplane of perceptron is x1*w1+x2*w2+x3*w3+b=0; The RBF kernel formula: k(x,z) = exp((-|x-z|^2)/2*variance^2) where takes action the radial basis function kernel here. Is x an input and what is z variable here?
Or of what I have to calculate variance if it is variance in the formula?
Somewhere I have understood so that I have to plug this formula in perceptron decision function x1*w1+x2*w2+x3*w3+b=0; but how does it look look like If I plug in?
I would like to ask a numerical example to avoid confusion.
Linear Perceptron
As you know linear perceptrons can be trained for binary classification. More precisely, if there is n features, x1, x2, ..., xn in n-dimensional space, Rn, and you want to label them in 2 categories, y1 & y2 (usually -1 and +1), you can use linear perceptron which defines a hyperplane w1*x1 + ... + wn*xn + b = 0 to do so.
w1*x1 + ... + wn*xn + b > 0 or W.X + b > 0 ==> class = y1
w1*x1 + ... + wn*xn + b < 0 or W.X + b < 0 ==> class = y2
Linear perceptron will work well, only if the problem is linearly separable in Rn. For example, in 2D space, this means that one line can separate the 2 sets of points.
Algorithm
One common algorithm to train the perceptron, i.e., find weights and bias, w's & b, based on N data points, X1, ..., XN, and their labels, Y1, ..., YN is the following:
Initialize: W = zeros(n,1); b = 0
For i=1 to N:
Calculate F(Xi) = W.Xi + b
If F(Xi)*Yi <= 0:
W <--- W + Xi*Yi
b <--- b + Yi
This will give the final value for W & b. Besides, based on the training, W will be a linear combination of training points, Xi's, more precisely, the ones that were misclassified. So W = a1*X1 + ... + ...aN*XN where a's are in {0,y1,y2}.
Now, if there is a new point, let's say Z, to label, we check the sign of F(Z) = W.Z + b = a1*(X1.Z) + ... + aN*(XN.Z) + b. It is interesting that only the inner product of new point and training points take part in it.
Kernel Perceptron
Now, if the problem is not linearly separable, one may try to go to a higher dimensional space in which a hyperplane can do the classification. As an example, consider a circle in 2D space. The points inside and outside of the circle can't be separated by a line. However, if you find a transformation that can take the points to 3D space such that the first 2 coordinates remain the same for all points, and the 3rd coordinate become +1 and -1 for the points inside and outside of the circle respectively, then a plane defined as 3rd coordinate = 0 can separate the points.
Finding such transformations can be difficult and computationally heavy, so the kernel trick is introduced. Notice that we only used the inner product of new points with the training points. Kernel trick employs this fact and defines the inner product of the transformed points without actually finding the transformation.
If the unknown transformation is P(X) then Kernel function will be:
K(Xi,Xj) = <P(Xi),P(Xj)>. So instead of finding P, kernel functions are defined which represent the scalar result of the inner product in high-dimensional space. There are also theorems about what functions can be kernel functions, i.e., correspond to inner product in another space.
After choosing a kernel function, the algorithm will be modified as follows:
Initialize: F(X) = 0
For i=1 to N:
Calculate F(Xi)
If F(Xi)*Yi <= 0:
F(.) <--- F(.) + K(.,Xi)*Yi + Yi
At the end, F(.) = a1*K(.,X1) + ... + ...aN*K(.,XN) + b where a's are in {0,y1,y2}.
RBF Kernel
Radial basis function is one type of kernel function that is actually computing the inner product in an infinite-dimensional space. It can be written as
K(Xi,Xj) = exp(- norm2(Xi-Xj)^2 / (2*sigma^2))
Sigma is some parameter that you can work with to find an optimum value for. For example, you can train the model with different values of sigma and then find the best value based on the performance. You can start with sigma = 1
After training the model to find F(.), for a new data Z, the sign of F(Z) = a1*K(Z,X1) + ... + ...aN*K(Z,XN) + b will determine the class.
Remarks:
Regarding to your question about variance, you don't need to find any variance.
About x and z in your question, in each iteration, you should find the kernel output for the current data point and all the previously added points (the points that were misclassified and hence were added to F).
I couldn't come up with a simple instructive numerical example.
References:
I borrowed some notation from
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjVu-fXo8DOAhVDxCYKHQkcDDAQFggoMAE&url=http%3A%2F%2Falex.smola.org%2Fteaching%2Fpune2007%2Fpune_3.pdf&usg=AFQjCNHlxy9TnY8xNe2-QDERipN_GycSqQ&bvm=bv.129422649,d.eWE
I've been running a machine learning algorithm, I have output in the form of Precision, Recall, and F-Measure.
I'd like to graph this data so I can get a clearer conception of how things are really going, but I don't really know how to do that. I suppose I can use Octave? I heard about it in that Andrew Ng course and I've already got it on my machine, but I don't really know how to use it to visualize data.
Does anyone with experience in this know how I might best proceed or some helpful resources on the best way to go about this?
0.011723329425556858 P 0.6000000238418579 R 0.010416666977107525 F1 0.02047781631341665
0.012895662368112544 P 0.6363636255264282 R 0.01215277798473835 F1 0.023850085569817648
0.01406799531066823 P 0.6666666865348816 R 0.013888888992369175 F1 0.027210884568890845
0.015240328253223915 P 0.6153846383094788 R 0.013888888992369175 F1 0.02716468612858015
0.016412661195779603 P 0.6428571343421936 R 0.015625 F1 0.03050847456668239
0.017584994138335287 P 0.6000000238418579 R 0.015625 F1 0.03045685282259509
0.01875732708089097 P 0.5625 R 0.015625 F1 0.030405405405405407
0.01992966002344666 P 0.529411792755127 R 0.015625 F1 0.030354131580674088
0.021101992966002344 P 0.5555555820465088 R 0.0173611119389534 F1 0.03367003527554599
0.022274325908558032 P 0.5263158082962036 R 0.0173611119389534 F1 0.03361344696816966
0.023446658851113716 P 0.5 R 0.0173611119389534 F1 0.033557048526295
0.0246189917936694 P 0.4761904776096344 R 0.0173611119389534 F1 0.03350083906570289
I suppose the first column is some threshold you varied between lines.
The precision-recall graph is precision-vs-recall. Thus we can first retrieve those two columns from your data: (suppose your data are saved in prf.data).
cat prf.data | awk '{print $3,$5}'
You will get below two columns only and you can initialize a 2d matrix in octave:
data = [
0.6000000238418579 0.010416666977107525
0.6363636255264282 0.01215277798473835
0.6666666865348816 0.013888888992369175
0.6153846383094788 0.013888888992369175
0.6428571343421936 0.015625
0.6000000238418579 0.015625
0.5625 0.015625
0.529411792755127 0.015625
0.5555555820465088 0.0173611119389534
0.5263158082962036 0.0173611119389534
0.5 0.0173611119389534
0.4761904776096344 0.0173611119389534];
Then under octave, below command will print each row as a data point in the graph:
plot(data(:,2), data(:,1), 'x')
ylabel('precision')
xlabel('recall')
Looks like with some threshold increase, you are decreasing precision and the recall stays the same (for example, when threshold = 0.021, 0.022, 0.023, 0.024).
CRF++ says it can:
"Can output marginal probabilities for all candidates" on its page: http://crfpp.sourceforge.net/
But what's the notation of the formula that's used to find these probabilities, in conditional random fields?
Someone told me it's not simply p(a|b), because conditional random fields use context from adjacent observations.
What exactly are these marginal probabilities?
The conditional probability is just p(y|x) where y is a sequence of labels and x is the associated observed sequence.
The expression for this probability is just the softmax function \exp( a_i ) / \sum_{i'} \exp ( a_{i'}).
For a CRF, a_i is a function of the label sequence a_i = w \cdot \phi(x,y), where \phi(x,y) is a feature vector derived from a sequence and its labels.
This means that the sum in the denominator is over the exponential number of possible labels, \mathcal{Y}:
\sum_{y' \in \mathcal{Y}} \exp ( w \cdot \phi(x,y) )