SAS Proc UCM ODS tables output for multiple iterations - stored-procedures

I am trying to run multiple iterations(5400) of various combinations of predictors in Proc UCM. The SAS code with all combination of variables has been sequentially generated using an external programming code like csharp etc.
I think this can be done in SAS but thats a topic for next time.
I want to compile all the parameter estimates of all iterations into one table and all the "Smoothed estimate of sum of all components except the irregular component" into another and write some conditions on it.
Currently i am writing all the output to csv for no good.I just know i can use ODS option and specify "ParameterEstimates" and "SmoothedAllExceptIrreg" but to do it for whole 5400 iterations is something i cant figure out.
My sample code for 4 iterations is as below (1347 1348 1349 1350 are the iterations sequentially geneared based on combinations):
/Level with variance,Season without variance,Slope without variance/
ods csv file = "D:\SAS\Iteration_Hair_Care_Volume_1347.csv";
proc ucm data = project.Compiled printall;
irregular ;
level ;
forecast lead = 8 alpha = 0.05 outfor = project.Compiled_Output_Hair;
model
HC =
Gr_CPI
gdp;
season length=4
var = 0 noest;
slope
var = 0 noest;
where count > 36;
run;
quit;
ods csv.close;
/*Level with variance,Season without variance,Slope with Variance*/
ods csv file = "D:\SAS\Iteration_Hair_Care_Volume_1348.csv";
proc ucm data = project.Compiled printall;
irregular ;
level ;
forecast lead = 8 alpha = 0.05 outfor = project.Compiled_Output_Hair;
model
HC =
Gr_CPI
Earning_Avg_Lag;
season length=4
var = 0 noest;
slope ;
where count > 36;
run;
quit;
ods csv.close;
/*Level with variance,Season with variance,Slope without Variance*/
ods csv file = "D:\SAS\Iteration_Hair_Care_Volume_1349.csv";
proc ucm data = project.Compiled printall;
irregular ;
level ;
forecast lead = 8 alpha = 0.05 outfor = project.Compiled_Output_Hair;
model
HC =
Gr_CPI
pdi
season length=4 ;
slope
var = 0 noest;
where count > 36;
run;
quit;
ods csv.close;
/*Level with variance,Season with variance,Slope with Variance*/
ods csv file = "D:\SAS\Iteration_Hair_Care_Volume_1350.csv";
proc ucm data = project.Compiled printall;
irregular ;
level ;
forecast lead = 8 alpha = 0.05 outfor = project.Compiled_Output_Hair;
model
HC =
Gr_CPI
IIP;
season length=4 ;
slope ;
where count > 36;
run;
quit;
ods csv.close;
I can append iteration numbers to outfor dataset but i dont know what good comes out of it. How do i automate this. Kindly help me with this.

Related

Octave script falling into error when processing data with sampling frequency above 50kS/s

I'm working with an Octave script to process data files with high sample rates (up to 200kS/s collected over 3 minutes). The code runs into issues when processing any files with a sample rate above 50kS/s, regardless of size or number of samples but functions correctly otherwise.
The error I receive when attempting to run the code with files above 50kS/s is called from the hist function:
error: x(0): subscripts must be either integers 1 to (2^63)-1 or logicals
error:
I have narrowed the cause to the following section of code (note that FS is the detected sampling frequency):
FILTER_ORDER = 1;
FILTER_CUTOFF = 1 / (2 * pi * 300e-3);
[b_lp, a_lp] = butter(FILTER_ORDER, FILTER_CUTOFF / (FS / 2), 'low');
%
s = SCALING_FACTOR * filter(b_lp, a_lp, u_q) ;
P = s ;
%
tau = 20;
transient = tau * FS; % index after the transient
Pmax = max(P(transient:end));
%
s = s(transient:end);
%
NUMOF_CLASSES = 10000; % number of bins used for the histogram
[bin_cnt, cpf.magnitude] = hist(s, NUMOF_CLASSES); % sorts data into the number of bins specified
I can try to provide more information if required, I'm not very familiar with Octave.

Time series simulation (Monte Carlo) code

I'm trying to make a Monte Carlo Simulation with time series and I can't get what I'm doing wrong due to my little knowledge in Stata.
With my ARMA model I'm able to create a time series with 300 observations.
My idea is to do this process 1000 times and in every process I want to save the mean and variance of the 300 observations in a matrix.
This is my ARMA model:
This is my code:
clear all
set more off
set matsize 1000
matrix simulaciones =J(1000,2,0) *To save every simulation of every time series generated
matrix serie = J(300,3,0) *To save each time series
set obs 300 *For the 300 observations in every time series
gen t = _n
tsset t
g y1=0
forvalue j = 1(1)1000{
* Creating a time series
forvalues i = 1(1)300 {
gen e = rnormal(0,1)
replace y1=0 if t==1
replace y1 = 0.7*L1.y1 + e - 0.6*L1.e if t == 2
replace y1 = 0.7*L1.y1 - 0.1*L2.y1 + e - 0.6*L1.e + 0.08*L2.e if t > 2
matrix serie[`i',3] = y1
drop e y1
}
svmat serie
matrix simulaciones[`j',1] = mean(y1)
matrix simulaciones[`j',2] = var(y1)
}
I have no idea how to follow and any idea or recommendation is more than welcomed.
Thanks a lot for your help and time.

Hide p_value and put stars to significant OR gtsummary

I'm using gtsummary package.
I need to merge different univariate logistic regression and in order to have a good presentation, I want to hide the p_value and bold or put a star to the significant OR (p< 0.05).
Anyone can help me?
Maybe it's easier to use another presentation type like kable, huxtable, I don't know?
Thank you for your help.
Have a nice day
There is a function called add_significance_stars() that hides the p-value and adds stars to the estimate indicating various levels of statistical significance. I've also added code to bold the estimate if significant with modify_table_styling().
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
trial %>%
select(death, age, grade) %>%
tbl_uvregression(
y = death,
method = glm,
method.args = list(family = binomial),
exponentiate = TRUE
) %>%
# add significance stars to sig estimates
add_significance_stars() %>%
# additioanlly bolding significant estimates
modify_table_styling(
columns = estimate,
rows = p.value < 0.05,
text_format = "bold"
)
Created on 2021-04-14 by the reprex package (v2.0.0)
Here's a quick huxtable version:
l1 <- glm(I(cyl==8) ~ gear, data = mtcars, family = binomial)
l2 <- glm(I(cyl==8) ~ carb, data = mtcars, family = binomial)
huxtable::huxreg(l1, l2, statistics = "nobs", bold_signif = 0.05)
────────────────────────────────────────────────────
(1) (2)
───────────────────────────────────
(Intercept) 5.999 * -1.880 *
(2.465) (0.902)
gear -1.736 *
(0.693)
carb 0.579 *
(0.293)
───────────────────────────────────
nobs 32 32
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1, model2
It doesn't show it here, but the significant coefficients are bold on screen (and in any other kind of output).

Is there a faster way of to generate the required output than using a one-to-many join in Proc SQL?

I require an output that shows the total number of hours worked in a rolling 24 hour window. The data is currently stored such that each row is one hourly slot (for example 7-8am on Jan 2nd) per person and how much they worked in that hour stored as "Hour". What I need to create is another field that is the sum of the most recent 24 hourly slots (inclusive) for each row. So for the 7-8am example above I would want the sum of "Hour" across the 24 rows: Jan 1st 8-9am, Jan 1st 9-10am... Jan 2nd 6-7am, Jan 2nd 7-8am.
Rinse and repeat for each hourly slot.
There are 6000 people, and we have 6 months of data, which means the table has 6000 * 183 days * 24 hours = 26.3m rows.
I am currently done this using the code below, which works on a sample of 50 people very easily, but grinds to a halt when I try it on the full table, somewhat understandably.
Does anyone have any other ideas? All date/time variables are in datetime format.
proc sql;
create table want as
select x.*
, case when Hours_Wrkd_In_Window > 16 then 1 else 0 end as Correct
from (
select a.ID
, a.Start_DTTM
, a.End_DTTM
, sum(b.hours) as Hours_Wrkd_In_Window
from have a
left join have b
on a.ID = b.ID
and b.start_dttm > a.start_dttm - (24 * 60 * 60)
and b.start_dttm <= a.start_dttm
where datepart(a.Start_dttm) >= &report_start_date.
and datepart(a.Start_dttm) < &report_end_date.
group by ID
, a.Start_DTTM
, a.End_DTTM
) x
order by x.ID
, x.Start_DTTM
;quit;
The most performant DATA step solution most likely involves a ring-array to track the 1hr time slots and hours worked within. The ring will allow a rolling aggregate (sum and count) to be computed based on what goes into and out of the ring.
If you have a wide SAS license, look into the procedures in SAS/ETS (Econometrics and Time Series). Proc EXPAND might have some rolling aggregate capability.
This sample DATA Step code took <10s (WORK folder on SSD) to run on simulated data for 6k people with 6months of complete coverage of 1hr time slots.
data have(keep=id start_dt end_dt hours);
do id = 1 to 6000;
do start_dt
= intnx('dtmonth', datetime(), -12)
to intnx('dtmonth', datetime(), -6)
by dhms(0,1,0,0)
;
end_dt = start_dt + dhms(0,1,0,0);
hours = 0.25 * floor (5 * ranuni(123)); * 0, 1/4, 1/2, 3/4 or 1 hour;
output;
end;
end;
format hours 5.2;
run;
/* %let log= ; options obs=50 linesize=200; * submit this (instead of next) if you want to log the logic; */
%let log=*; options obs=max;
data want2(keep=id start_dt end_dt hours hours_rolling_sum hours_rolling_cnt hours_out_:);
array dt_ring(24) _temporary_;
array hr_ring(24) _temporary_;
call missing (of dt_ring(*));
call missing (of hr_ring(*));
if 0 then set have; * prep pdv column order;
hours_rolling_sum = 0;
hours_rolling_cnt = 0;
label hours_rolling_sum = 'Hours worked in prior 24 hours';
index = 0;
do until (last.id);
set have;
by id start_dt;
index + 1;
if index > 24 then index = 1;
hours_out_sum = 0;
hours_out_cnt = 0;
do clear = 1 by 1 until (clear=0);
if sum (dt_ring(index), 0) = 0 then do;
* index is first go through ring array, or hit a zeroed slot;
&log putlog 'NOTE: ' index= 'clear for empty ring item. ';
clear = 0;
end;
else
if start_dt - dt_ring(index) >= %sysfunc(dhms(0,24,0,0)) then do;
&log putlog / 'NOTE: ' index= 'reducting and zeroing.' /;
hours_out_sum + hr_ring(index);
hours_out_cnt + 1;
hours_rolling_sum = hours_rolling_sum - hr_ring(index);
hours_rolling_cnt = hours_rolling_cnt - 1;
dt_ring(index) = 0;
hr_ring(index) = 0;
* advance item to next item, that might also be more than 24 hours ago;
index = index + 1;
if index > 24 then index = 1;
end;
else do;
&log putlog / 'NOTE: ' index= 'back off !' /;
* index was advanced to an item within 24 hours, back off one;
index = index - 1;
if index < 1 then index = 24;
clear = 0;
end;
end; /* do clear */
dt_ring(index) = start_dt;
hr_ring(index) = hours;
hours_rolling_sum + hours;
hours_rolling_cnt + 1;
&log putlog 'NOTE: ' index= 'overlaying and aggregating.' / 'NOTE: ' start_dt= hours= hours_rolling_sum= hours_rolling_cnt=;
output;
end; /* do until */
format hours_rolling_sum 5.2 hours_rolling_cnt 2.;
format hours_out_sum 5.2 hours_out_cnt 2.;
run;
options obs=max;
When reviewing the results you should notice the delta for hours_rolling_sum is +(hours in slot) - (hours_out_sum{which is hrs removed from ring})
If you must use SQL, I would suggest following #jspascal and index the table, but rearrange the query to left join original data to inner-joined subselect (so that SQL will do an index involved hash join on the ids) . For same amount of few people it should faster than original query, but still be too slow for doing all 6K.
proc sql;
create index id on have;
create index id_slot on have (id, start_dt);
quit;
proc sql _method;
reset inobs=50; * limit data so you can see the _method;
create table want as
select
have.*
, case
when ROLLING.HOURS_WORKED_24_HOUR_PRIOR > 16
then 1
else 0
end as REVIEW_TIME_CLOCKING_FLAG
from
have
left join
(
select
EACH_SLOT.id
, EACH_SLOT.start_dt
, count(*) as SLOT_COUNT_24_HOUR_PRIOR
, sum(PRIOR_SLOT.hours) as HOURS_WORKED_24_HOUR_PRIOR
from
have as EACH_SLOT
join
have as PRIOR_SLOT
on
EACH_SLOT.ID = PRIOR_SLOT.ID
and EACH_SLOT.start_dt - PRIOR_SLOT.start_dt between 0 and %sysfunc(dhms(0,24,0,0))-0.1
group by
EACH_SLOT.id, EACH_SLOT.start_dt
) as ROLLING
on
have.ID = ROLLING.ID
and have.start_dt = ROLLING.start_dt
order by
id, start_dt
;
%put NOTE: SQLOOPS = &SQLOOPS;
quit;
The inner join is pyramid-like and still involves a lot of internal looping.
A compound index on the columns being accessed in the joined table - id + start_dttm + hours - would be useful if there isn't one already.
Using msglevel=i will print some diagnostics about how the query is executed. It may give some additional hints.

shape of input to calculate information gain

I want to calculate the information gain on 20_newsgroup data set.
I am using the code here(also I put a copy of the code down of the question).
As you see the input to the algorithm is X,y
My confusion is that, X is going to be a matrix with documents in rows and features as column. (according to 20_newsgroup it is 11314,1000
in case i only considered 1000 features).
but according to the concept of information gain, it should calculate information gain for each feature.
(So I was expecting to see the code in a way loop through each feature, so the input to the function be a matrix where rows are features and columns are class)
But X is not feature here but X stands for documents, and I can not see the part in the code that take care of this part! ( I mean considering each document, and then going through each feature of that document; like looping through rows but at the same time looping through columns as the features are stored in columns).
I have read this and this and many similar questions but they are not clear in terms of input matrix shape.
this is the code for reading 20_newsgroup:
newsgroup_train = fetch_20newsgroups(subset='train')
X,y = newsgroup_train.data,newsgroup_train.target
cv = CountVectorizer(max_df=0.99,min_df=0.001, max_features=1000,stop_words='english',lowercase=True,analyzer='word')
X_vec = cv.fit_transform(X)
(X_vec.shape) is (11314,1000) which is not features in the 20_newsgroup data set. I am thinking am I calculating Information gain in a incorrect way?
This is the code for Information gain:
def information_gain(X, y):
def _calIg():
entropy_x_set = 0
entropy_x_not_set = 0
for c in classCnt:
probs = classCnt[c] / float(featureTot)
entropy_x_set = entropy_x_set - probs * np.log(probs)
probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
for c in classTotCnt:
if c not in classCnt:
probs = classTotCnt[c] / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
return entropy_before - ((featureTot / float(tot)) * entropy_x_set
+ ((tot - featureTot) / float(tot)) * entropy_x_not_set)
tot = X.shape[0]
classTotCnt = {}
entropy_before = 0
for i in y:
if i not in classTotCnt:
classTotCnt[i] = 1
else:
classTotCnt[i] = classTotCnt[i] + 1
for c in classTotCnt:
probs = classTotCnt[c] / float(tot)
entropy_before = entropy_before - probs * np.log(probs)
nz = X.T.nonzero()
pre = 0
classCnt = {}
featureTot = 0
information_gain = []
for i in range(0, len(nz[0])):
if (i != 0 and nz[0][i] != pre):
for notappear in range(pre+1, nz[0][i]):
information_gain.append(0)
ig = _calIg()
information_gain.append(ig)
pre = nz[0][i]
classCnt = {}
featureTot = 0
featureTot = featureTot + 1
yclass = y[nz[1][i]]
if yclass not in classCnt:
classCnt[yclass] = 1
else:
classCnt[yclass] = classCnt[yclass] + 1
ig = _calIg()
information_gain.append(ig)
return np.asarray(information_gain)
Well, after going through the code in detail, I learned more about X.T.nonzero().
Actually it is correct that information gain needs to loop through features.
Also it is correct that the matrix scikit-learn give us here is based on doc-features.
But:
in code it uses X.T.nonzero() which technically transform all the nonzero values into array. and then in the next row loop through the length of that array range(0, len(X.T.nonzero()[0]).
Overall, this part X.T.nonzero()[0] is returning all the none zero features to us :)

Resources