I've a table:
ID No. of IA Timeline
001 1 after
001 1 after
001 1 after
002 1 after
002 1 after
003 1 after
003 1 after
003 1 after
003 1 after
004 0 after
005 1 after
005 1 after
When I placed in in Tableau I get:
What I want to do is to use Agg(Total Cou... and make a histogram.
but since I've and Aggregation value I'm unable to do that or find a solution even in Tableau official forums.
Related
I've a dataframe with daily items selling: the goal is forecasting on future selling for a good warehouse supply. I'm using XGBoost as Regressor.
date
qta
prezzo
year
day
dayofyear
month
week
dayofweek
festivo
2014-01-02 00:00:00
6484.8
1
2014
2
2
1
1
3
1
2014-01-03 00:00:00
5300
1
2014
3
3
1
1
4
1
2014-01-04 00:00:00
2614.9
1.1
2014
4
4
1
1
5
1
2014-01-07 00:00:00
114.3
1.1
2014
7
7
1
2
1
0
2014-01-09 00:00:00
11490
1
2014
9
9
1
2
3
0
The date is also the index of my dataframe. Qta is the label (the dependent variable) and all the others are the features.
As you can see it's a daily sampling but some days are missing (i.e. 5,6,8).
Could it be a problem during fitting and prediction of future days?
Am i supposed to fill the missing days with qta = 0?
For the purpose of an event study, I would like to create lagging and leading variables in my data set. Unfortunately, my data is not in a balanced panel format.
My data set looks as:
clear
input id year month binary str7 implement
28845421 2007 3 0 2008-1
29118744 2018 10 1 2012-6
29118744 2016 7 1 2016-7
29183010 2019 3 1 2010-1
29320027 2013 3 0 2015-2
end
. list
+---------------------------------------------+
| id year month binary implem~t |
|---------------------------------------------|
1. | 2.88e+07 2007 3 0 2008-1 |
2. | 2.91e+07 2018 10 1 2012-6 |
3. | 2.91e+07 2016 7 1 2016-7 |
4. | 2.92e+07 2019 3 1 2010-1 |
5. | 2.93e+07 2013 3 0 2015-2 |
+---------------------------------------------+
The variable binary equals 1 at the point when its year and month combination has reached the implement date. Each observation is represented by an identifier id.
The goal is to create lagging and leading variables of the binary: binary-5, binary-4, ..., binary+1, binary+2, .... In other words, I want to shift implement by n years of increments / decrement to create the new binary variables.
How can I create such variables in Stata?
I've got some data structured as such
select * from rules where time > now() - 1m limit 5
name: rules
time ackrate consumers deliverrate hostname publishrate ready redeliverrate shard unacked version
---- ------- --------- ----------- -------- ----------- ----- ------------- ----- ------- -------
1513012628943000000 864 350 861.6 se-rabbit14 975.8 0 0 14 66 5
1513012628943000000 864.8 350 863 se-rabbit9 920.8 0 0 09 64 5
1513012628943000000 859.8 350 860.2 se-rabbit8 964.2 0 0 08 58 5
1513012628943000000 864.8 350 863.6 se-rabbit16 965.4 0 0 16 64 5
1513012631388000000 859.8 350 860.2 se-rabbit8 964.2 0 0 08 58 5
I want to calculate the percentage of 'up-time' defined as the amount of time when the queue has no ready messages.
I can get the maximum number of ready in each minute
select max(ready) from rules where time > now() - 1h group by time(1m) limit 5
name: rules
time max
---- ---
1513009560000000000 0
1513009620000000000 0
1513009680000000000 0
1513009740000000000 0
1513009800000000000 0
Using a sub-query, I can select only the minutes that have values ready.
select ready from (select max(ready) as ready from rules where time > now() - 1h group by time(1m)) where ready > 0
name: rules
time ready
---- -----
1513010520000000000 49
1513013280000000000 57
I wanted to get a count of these values and then doing a bit of math calculate a percentage. In this case, with 2 results in the last hour,
((60 minutes * 1 hour) - 2) / (60 minutes * 1 hour)) == 96%
When I try to count this though, I get no response.
select count(ready) from (select max(ready) as ready from rules where time > now() - 1h group by time(1m)) where ready > 0
This is v1.2.2.
How can I return a count of the number of results?
The solution was simply to upgrade from v1.2.2 to v1.3.8. Using the later version.
select count(ready) from (select max(ready) as ready from rules where time > now() - 1h group by time(1m)) where ready > 0
name: rules
time count
---- -----
0 6
I have been trying to run Cox PH model on a sample data set of 10k customers (randomly taken from 32 million customer base) for predicting probability of survival in time t (which is month in my case). I am using recurrent event survival analysis using counting process for e-commerce. For this...
1. Observation starting point: right after a customer makes first purchase
2. Start/Stop times: Months of two consecutive purchases (as in the data)
I have a few independent variables as in the sample data below:
id start stop status tenure orders revenue Quantity
A 0 20 0 0 1 $89.0 1
B 0 17 0 0 1 $556.0 2
C 0 17 0 0 1 $900.0 2
D 32 33 0 1679 9 $357.8 9
D 26 32 1 1497 7 $326.8 7
D 23 26 1 1405 4 $142.9 4
D 17 23 1 1219 3 $63.9 3
D 9 17 1 978 2 $50.0 2
D 0 9 1 694 1 $35.0 1
E 0 15 0 28 2 $156.0 2
F 0 15 0 0 1 $348.0 1
F 12 14 0 375 2 $216.8 3
F 0 12 1 0 1 $67.8 2
G 9 15 0 277 2 $419.0 2
G 0 9 1 0 1 $359.0 1
While running cox PH using the following code:
fit10=coxph(Surv(start,stop,status)~orders+tenure+Quantity+revenue,data=test)
I keep getting the following error:
Warning: X matrix deemed to be singular; variable 1 2 3 4
I tried searching for the same error online but the answers I found said this could be because of interacting independent variables, whereas my variables are individual and continuous.
I have a 7G file in SPSS format. It has some survey data and has comment level scores and sentence level scores. One comment can have multiple sentences, and one survey has up to 4 comments.
I am trying to do random sampling in SPSS so I can use the smaller file in R, but if I do Simple Random Sampling then I am not able to keep the whole survey and comment together.
What I want is to take a sample from this big file and only pick 5% of the surveyIds, so the rows for the whole survey stays together.
Surv_ID Sentence_ID Comment_ID Sentence_Score Comment_Score
A001 001 1 3.5 2
A001 002 1 2.8 2
A001 001 2 1.4 -1
A001 002 2 -2.9 -1
A001 003 2 -3.1 -1
A002 001 1 2.3 3
A002 002 1 4.3 3
A002 001 2 1.2 1
A002 002 2 0.85 1
A002 003 2 0.79 1
A002 001 3 3.5 2
A002 002 3 -3.1 2
A002 003 3 2.8 2
A003 001 1 1 1
A003 001 2 -0.9 -3
A003 002 2 -4.3 -3
A003 003 2 -4.0 -3
A003 001 3 3.4 3
A003 002 3 4.4 3
A003 001 4 2.8 2
COMPUTE RandNum=RV.UNIFORM(0,1).
AGGREGATE OUTFILE=* MODE=ADDVARIABLES OVERWRITE=YES /BREAK=Surv_ID /RandNum=MAX(RandNum).
SORT CASES BY RandNum Surv_ID.
COMPUTE SurvIDNum=SUM(LAG(SurvIDNum),(LAG(Surv_ID)<>Surv_ID)=1 OR $CASENUM=1).
AGGREGATE OUTFILE=* MODE=ADDVARIABLES /TotN=N.
COMPUTE SurvIDNumPCT=SurvIDNum/TotN.
SELECT IF (SurvIDNumPCT<0.05).
Create random variable for all cases
Assign a maximum random value for all unique Surv_ID
Sort cases by random variable and clustered by Surv_ID
Create a numeric counter for sequential Surv_ID's
Divide this value by total number of cases to get percentage
Select as many cases as required
For the steps above here are corresponding instructions to where to find relevant GUI equivalents to achieve the same.
Transform -> Compute Variable
Data -> Aggregate
Data -> Sort cases
Transform -> Compute Variable
Transform -> Compute Variable
Data -> Select cases