I am trying to use the community-contributed command frmttable in Stata to generate a table summary statistics of date variables.
However, when I execute the command, the summary statistics are not in the date format, but rather are integers. I would like them to be displayed in a MDY format: %dtNN/DD/CCYY
The problem is shown below:
Step Dates
-------------------
Step Date
-------------------
Step 1 17,206
Step 2 17,241
Step 3 17,258
Step 4 17,619
Step 5 17,958
Step 6 18,401
Step 7 18,464
Step 8 18,976
Step 9 18,965
Step 10 19,243
Step 11 19,064
-------------------
I am not considering other table exporting commands since frmttable gives me the most flexibility. I am also trying to export the table into LaTeX.
Example data can be found below:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double Step_n float Date
2 17206
2 17234
3 17241
3 17339
4 17258
4 17626
5 17619
5 17619
5 18155
6 17958
6 19339
7 18401
7 18662
8 18464
8 19001
8.5 18976
8.5 19267
9 18965
9.5 19243
10 19064
10 20227
end
format %tdNN/DD/CCYY Date
The code I used is the following:
matrix m1 = J(11,1,.)
local i = 1
foreach s of numlist 2/8 8.5 9 9.5 10 {
quietly summarize Date if Step_n==`s'
matrix m1[`i',1]=r(min)
local i = `i' + 1
}
matrix rownames m1 = "Step 1" "Step 2" "Step 3" "Step 4" ///
"Step 5" "Step 6" "Step 7" "Step 8" "Step 9" "Step 10" "Step 11"
matrix list m1, format(%tdNN/DD/CCYY)
frmttable using m1.tex, statmat(m1) title("Step Dates") ///
sdec(0) ctitle("Step","Date") replace tex
The community-contributed command frmttable is used to produce tables for summary statistics, the format of which can be specified by the sfmt() option.
However, as its help file suggests, in its current version this does not support date formats:
"...fmtgrid has the form fmt[,fmt...] [\ fmt[,fmt...] ...]], where fmt is either e, f, fc, g, or gc..."
An attempt to run frmttable with such a format specified confirms this:
. frmttable, statmat(m1) sfmt(%tdNN/DD/CCYY)
sfmt contains elements other than "e","f","g","fc", and "gc"
r(198);
The community-contributed command esttab offers an out-of-the-box solution:
esttab matrix(m1, fmt(%tdNN/DD/CCYY)), nomtitles ///
collabel("Date") ///
title("Step Dates") ///
tex
\begin{table}[htbp]\centering
\caption{Step Dates}
\begin{tabular}{l*{1}{c}}
\hline\hline
& Date \\
\hline
Step 1 & 02/09/2007\\
Step 2 & 03/16/2007\\
Step 3 & 04/02/2007\\
Step 4 & 03/28/2008\\
Step 5 & 03/02/2009\\
Step 6 & 05/19/2010\\
Step 7 & 07/21/2010\\
Step 8 & 12/15/2011\\
Step 9 & 12/04/2011\\
Step 10 & 09/07/2012\\
Step 11 & 03/12/2012\\
\hline\hline
\end{tabular}
\end{table}
Related
Table 1:
Position
Team
1
MCI
2
LIV
3
MAN
4
CHE
5
LEI
6
AST
7
BOU
8
BRI
9
NEW
10
TOT
Table 2
Position
Team
1
LIV
2
MAN
3
MCI
4
CHE
5
AST
6
LEI
7
BOU
8
TOT
9
BRI
10
NEW
Output I'm looking for is
Position difference = 10 as that is the total of the positional difference. How can I do this in excel/google sheets? So the positional difference is always a positive even if it goes up or down. Think of it as a league table.
Table 2 New (using formula to find positional difference):
Position
Team
Positional Difference
1
LIV
1
2
MAN
1
3
MCI
2
4
CHE
0
5
AST
1
6
LEI
1
7
BOU
0
8
TOT
2
9
BRI
1
10
NEW
1
Try this:
=IFNA(ABS(INDEX(A:B,MATCH(E2,B:B,0),1)-D2),"-")
Assuming that table 1 is at columns A:B:
I am trying to perform a randomforest survival analysis according to the RANDOMFORESTSRC vignette in R. I have a data frame containing 59 variables - where 14 of them are numeric and the rest are factors. 2 of the numeric ones are TIME (days till death) and DIED (0/1 dead or not). I'm running into 2 problems:
trainrfsrc<- rfsrc(Surv(TIME, DIED) ~ .,
data = train, nsplit = 10, na.action = "na.impute")
trainrfsrc gives: Error rate: 17.07%
works fine, however exploring the error rate such as:
plot(gg_error(trainrfsrc))+ coord_cartesian(y = c(.09,.31))
returns:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
or:
a<-(gg_error(trainrfsrc))
a
error ntree 1 NA 1 2 NA 2 3 NA 3 4 NA 4 5 NA 5 6 NA 6 7 NA 7 8 NA 8 9 NA 9 10 NA 10
NA for all 1000 trees.how come there's no error rate for each number of trees tried?
the second problem is when trying to explore the most important variables using VIMP such as:
plot(gg_vimp(trainrfsrc)) + theme(legend.position = c(.8,.2))+ labs(fill = "VIMP > 0")
it returns:
In gg_vimp.rfsrc(trainrfsrc) : rfsrc object does not contain VIMP information. Calculating...
Any ideas? Thanks
Setting the err.block=1 (or some integer between 1 and ntree) should fix the problem of returning NA for error. You can check the help file under rfsrc to read more about err.block.
I am trying to calculate points in a Formula 1 racing league. I'm having trouble with a bonus 15 points if a constructor qualifies 1st and finishes the race 1st. The issue is there could be two different drivers who do this. For example. As you can see, HAM qualified 1st and ROS finished 1st in the race. Because they both drive for Mercedes, 15 points need to be awarded to Mercedes. The data can't be moved around as it's imported using an API (not in the example) but a copy of the layout can be found here
Qualifying Race Driver Team
14 1 ROS mercedes
1 15 HAM mercedes
3 3 VET ferrari
8 4 RIC red_bull
6 5 MAS williams
19 6 GRO haas
10 7 HUL force_india
16 8 BOT williams
7 9 SAI toro_rosso
5 10 VES toro_rosso
13 11 PAL renault
Put this in I2 and copy down. See if that is how you want it:
=IF(AND(VLOOKUP(1, $A$2:$H$12, 8, FALSE)=VLOOKUP(1, $B$2:$H$12, 7, FALSE), VLOOKUP(1, $B$2:$H$12, 7, FALSE)=H2, MATCH(H2, H:H, 0)=ROW(H2)), 15, 0)
I wish to make a formula to sum up the value with 2 criteria, example show as below:-
A B C D E
1 1-Apr 2-Apr 3-Apr 4-Apr
2 aa 1 4 7 10
3 bb 2 5 8 11
4 cc 3 6 9 12
5
6 Criteria 1 bb
7 Range start 2-Apr-16
8 Range End 4-Apr-16
9 Total sum #VALUE!
tried formula
1 SUMIF(A2:A4,C6,INDEX(B2:E4,0,MATCH(C7,B1:E1,0)))
* Only return 1 cell value
2 SUMIF(A2:A4,C6,INDEX(B2:E4,0,MATCH(">="&C7,B1:E1,0)))
* Showed N/A error
3 SUMIFS(B2:E4,A2:A4,C6,B1:E1,">="&C7,B1:E1,"<="&C8)
* Showed #Value error
Hereby I attached a link of picture for better understanding :
Can anyone help me on the formula?
I figured out the solution with step evaluation:
=SUMIF(B1:F1,">="&C7,INDEX(B2:F4,MATCH(C6,A2:A4,0),0)) -
SUMIF(B1:F1,">"&C8,INDEX(B2:F4,MATCH(C6,A2:A4,0),0))
I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.