Bootstrap confidence intervals on mean by group are NA - mean

I'm trying to construct confidence intervals around each group mean for a plot I've made, but the bootstrap method hasn't worked. I'm sure I'm doing this incorrectly, but the best example I found online for estimating confidence intervals around the mean of each group was:
wet_pivot %>%
select(n_mean, CYR) %>%
group_by(CYR) %>%
summarise(data = list(smean.cl.boot(cur_data(), conf.int = .95, B = 1000, na.rm = TRUE))) %>%
tidyr::unnest_wider(data)
Result:
# A tibble: 13 x 4
CYR Mean Lower Upper
<dbl> <dbl> <dbl> <dbl>
1 2009 0.00697 NA NA
2 2010 0.000650 NA NA
3 2011 0.00288 NA NA
4 2012 0.0114 NA NA
5 2013 0.000536 NA NA
6 2014 0.00350 NA NA
7 2015 0.000483 NA NA
8 2016 0.00245 NA NA
9 2017 0.00292 NA NA
10 2018 0.00253 NA NA
11 2019 0.00196 NA NA
12 2020 0.00502 NA NA
13 2021 0.00132 NA NA
Am I making incorrect assumptions about my data with this method? Even if this worked, is it possible to manually add each confidence interval into a line plot using ggplot?
My data:
> head(wet_pivot)
WYR CYR Season N n_mean n_median sd se
1 2010 2009 WET 59 0.0069680693 0 0.030946706 0.0040289180
2 2011 2010 WET 63 0.0006497308 0 0.002489655 0.0003136671
3 2012 2011 WET 69 0.0028825655 0 0.010097383 0.0012155821
4 2013 2012 WET 70 0.0114108839 0 0.051577935 0.0061647423
5 2014 2013 WET 72 0.0005361741 0 0.003314688 0.0003906397
6 2015 2014 WET 71 0.0034958465 0 0.026606408 0.0031575998

Related

Circuits , Hazards/Karnaugh diagram || Can a Karnaugh-Vetch diagram consisting of 2 variables x and y even contain a hazard?

i dont understand completely but to me it seems like there cant be a problematic/hazardous path.
So its about hazards that can occur in a half-adder circuit with Inverters XORs and Ands. Cant get to create the structural term and diagram.
Would really appreciate help
To my unterstanding there cant be a data hazard but a structural hazard can obviously occur due to different gatters used. but cant get an Structural KV-Diagram out of it because theres only 2 variables x and y.
All 16 cases with the POS hazards explained are shown below:
\a 0 1
b\
0
1
0
00
00
0
1
10
00
not a . not b
2
01
00
a . not b
3
11
00
not b
4
00
10
not a . b
5
10
10
not a
6
01
10
a . not b + not a . b [(a, b)]: (1, 0) = 1 ==> (0, 1) = 1, but (0, 0) = 0 or (1, 1) = 0 could be hit during transition; SOP resolves this: not (a.b)
7
11
10
not b + not a [(a, b)]: (1, 0) = 1 ==> (0, 1) = 1, but (1, 1) = 0 could be hit during transition; SOP resolves this: not (a.b)
8
00
01
9
10
01
similar to 6 above
10
01
01
11
11
01
similar to 7 aove
12
00
11
13
10
11
similar to 7 aove
14
01
11
similar to 7 aove
15
11
11

Does XGBoost Regressor handles missing timesteps?

I've a dataframe with daily items selling: the goal is forecasting on future selling for a good warehouse supply. I'm using XGBoost as Regressor.
date
qta
prezzo
year
day
dayofyear
month
week
dayofweek
festivo
2014-01-02 00:00:00
6484.8
1
2014
2
2
1
1
3
1
2014-01-03 00:00:00
5300
1
2014
3
3
1
1
4
1
2014-01-04 00:00:00
2614.9
1.1
2014
4
4
1
1
5
1
2014-01-07 00:00:00
114.3
1.1
2014
7
7
1
2
1
0
2014-01-09 00:00:00
11490
1
2014
9
9
1
2
3
0
The date is also the index of my dataframe. Qta is the label (the dependent variable) and all the others are the features.
As you can see it's a daily sampling but some days are missing (i.e. 5,6,8).
Could it be a problem during fitting and prediction of future days?
Am i supposed to fill the missing days with qta = 0?

Arrange downloaded data into more useful way in google sheets

We currently have a fixed report data that we can only manipulate after download and to simplify, it looks like this:
raw report data extracted to google sheets
a b c
1 Start Date Time Adhering to Schedule (Hours) Time Not Adhering to Schedule (Hours)
2 Employee: A Supervisor: X
3 5/4/2022 7.65 1.35
4 5/5/2022 8.12 0.88
5 5/6/2022 6.95 2.05
6 5/9/2022 8.7 0.3
7 5/10/2022 7.45 1.55
8 5/11/2022 8.63 0.37
9 5/12/2022 8.08 0.92
10 5/13/2022 6.13 0.13
11 Totals: 61.71 7.55
12 Employee: B Supervisor: X
13 5/1/2022 3.8 0.27
14 5/2/2022 6.72 2.28
15 5/3/2022 6.1 2.9
16 5/4/2022 8.43 0.57
17 5/5/2022 5.85 0.53
18 5/10/2022 6.13 2.87
19 5/11/2022 0 1.5
20 5/12/2022 2 1.5
21 5/13/2022 1.75 1.75
22 Totals: 40.78 14.17
I would like some help in constructing a new sheet via formulas so that it rearranges the raw data as follows:
desired output
a b c d e
1 EMPLOYEE SUPERVISOR Start Date Time Adhering to Schedule (Hours) Time Not Adhering to Schedule (Hours)
2 A X 04/05/22 7.65 1.35
3 A X 05/05/22 8.12 0.88
4 A X 06/05/22 6.95 2.05
5 A X 09/05/22 8.70 0.30
6 A X 10/05/22 7.45 1.55
7 A X 11/05/22 8.63 0.37
8 A X 12/05/22 8.08 0.92
9 A X 13/05/22 6.13 0.13
10 B X 01/05/22 3.80 0.27
11 B X 02/05/22 6.72 2.28
12 B X 03/05/22 6.10 2.90
13 B X 04/05/22 8.43 0.57
14 B X 05/05/22 5.85 0.53
15 B X 10/05/22 6.13 2.87
16 B X 11/05/22 0.00 1.50
17 B X 12/05/22 2.00 1.50
18 B X 13/05/22 1.75 1.75
It probably needs some combination of QUERY() ARRAYFORMULA(), TRANSPOSE() and/or INDEX() or something.. but i can't quite figure it out. I need some help with to get started in the right track. the dates and data between employees are dynamic so the formula in the desired result needs to adjust to that as well.
thanks!
edit: adding a sample trix for reference :) https://docs.google.com/spreadsheets/d/1m_FCGcnXvnEiMZ8X4K1eEsMljORWV4V1Yq_81vFnx4Y/edit?usp=sharing
Gobal solution
in E1
={ArrayFormula(if(A1:A="Totals:",,{
substitute(lookup(row(A1:A),row(A1:A)/if(ISNUMBER(A1:A),0,1),A1:A),"Employee: ",""),
substitute(lookup(row(A1:A),row(A1:A)/if(ISNUMBER(A1:A),0,1),C1:C),"Supervisor: ","")
})),Arrayformula(if(ISNUMBER(A1:A),{A1:A,B1:B,C1:C},))}
In 3 steps (3 arrayformulas),
try in H1
=arrayformula(if(left(A1:A,6)="Totals",,if(left(A1:A,8)="Employee",{B1:B,D1:D,E1:E,E1:E,E1:E},{E1:E,E1:E,A1:A,B1:B,C1:C})))
then, back in F1 to complete all rows with employee and supervisor
=ArrayFormula({lookup(row(H:H),row(H:H)/if(H:H<>"",1,0),H:H),lookup(row(I:I),row(I:I)/if(I:I<>"",1,0),I:I)})
finally, if you want to reduce the presentation, in M1
=query(F:L,"select F,G,J,K,L where J is not null",0)

RandomForestSRC error and vimp

I am trying to perform a randomforest survival analysis according to the RANDOMFORESTSRC vignette in R. I have a data frame containing 59 variables - where 14 of them are numeric and the rest are factors. 2 of the numeric ones are TIME (days till death) and DIED (0/1 dead or not). I'm running into 2 problems:
trainrfsrc<- rfsrc(Surv(TIME, DIED) ~ .,
data = train, nsplit = 10, na.action = "na.impute")
trainrfsrc gives: Error rate: 17.07%
works fine, however exploring the error rate such as:
plot(gg_error(trainrfsrc))+ coord_cartesian(y = c(.09,.31))
returns:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
or:
a<-(gg_error(trainrfsrc))
a
error ntree 1 NA 1 2 NA 2 3 NA 3 4 NA 4 5 NA 5 6 NA 6 7 NA 7 8 NA 8 9 NA 9 10 NA 10
NA for all 1000 trees.how come there's no error rate for each number of trees tried?
the second problem is when trying to explore the most important variables using VIMP such as:
plot(gg_vimp(trainrfsrc)) + theme(legend.position = c(.8,.2))+ labs(fill = "VIMP > 0")
it returns:
In gg_vimp.rfsrc(trainrfsrc) : rfsrc object does not contain VIMP information. Calculating...
Any ideas? Thanks
Setting the err.block=1 (or some integer between 1 and ntree) should fix the problem of returning NA for error. You can check the help file under rfsrc to read more about err.block.

What to do if response (or label) columns are in another data frame?

I'm newbie in machine learning, so I need your advice.
Imagine, we have two data sets (df1 and df2).
First data set include about 5000 observations and some features, to simplify:
name age company degree_of_skill average_working_time alma_mater
1 John 39 A 89 38 Harvard
2 Steve 35 B 56 46 UCB
3 Ivan 27 C 88 42 MIT
4 Jack 26 A 87 37 MIT
5 Oliver 23 B 76 36 MIT
6 Daniel 45 C 79 39 Harvard
7 James 34 A 60 40 MIT
8 Thomas 28 B 89 39 Stanford
9 Charlie 29 C 83 43 Oxford
The learning problem - to predict productivity of companies from second data set (df2) for next period of time (june-2016), based on data from the first data set (df1).
df2:
company productivity date
1 A 1240 april-2016
2 B 1389 april-2016
3 C 1388 april-2016
4 A 1350 may-2016
5 B 1647 may-2016
6 C 1272 may-2016
So as we can see both data sets include feature "company". But I don't understand how I can create a link between these two features. What shoud I do with two data sets to solve the learning problem? Is it possible?

Resources