Find out if firm enters or leaves industries over time - time-series

I have a data set with information about firms:
clear
input firm_id str6 industry int fyear int
1084 7372 2010
1084 7375 2010
1084 7372 2011
1084 7375 2011
1084 7372 2012
1084 7375 2012
1084 7372 2013
1084 7375 2013
1084 7372 2014
1084 7375 2014
1094 2865 2002
1094 2879 2002
1094 5122 2002
1094 5169 2002
1094 2865 2003
1094 2879 2003
1094 5122 2003
1094 5169 2003
1094 2865 2004
1094 2879 2004
1094 5122 2004
1094 5169 2004
1094 2865 2005
1094 2879 2005
1094 5122 2005
1094 5169 2005
1094 2865 2006
1094 2879 2006
1094 5122 2006
1094 5169 2006
1094 2865 2007
1094 2879 2007
1094 5169 2007
1094 2865 2008
1094 2879 2008
end
Aside from a firm_id it includes information about which industries the firm is active in a given year.
How can I find how many industries a firm left and entered, in a given year?
I know that I could do this by writing a "loop in a loop" that looks at every individual observation and checks if the same firm_id and industry combination exists for year+1. But my data set is large, so that would be extremely inefficient.
I also contemplated solutions using reshape wide, but could not find a solution to my problem either (and of course this creates an extremely large number of variables and is not efficient either).

If you are trying to generate a single observation for each firm with the number of industries a firm enters and exits in each year, I believe the following code should work. The variables enter and leave indicate (respectively) if a firm enters or exists the industry in a given observation. Using a foreach loop over the years in the data you can then generate each of the variables indicating whether the firm enters or exits each year.
bys firm_id industry (fyear): gen prevyear = fyear[_n-1]
gen yrdifpast = fyear - prevyear
gen enter = yrdifpast > 1
bys firm_id industry (fyear): gen nextyear = fyear[_n+1]
gen yrdiffuture = nextyear - fyear
gen leave = yrdiffuture > 1
levelsof fyear, local(years)
foreach yr of local years {
gen in_`yr' = fyear==`yr'&enter==1
gen out_`yr' = fyear==`yr'&leave==1
}
collapse (sum) in_* out_*, by(firm_id)
list
+----------------------------------------------------------------------------------------------+
1. | firm_id | in_2002 | in_2003 | in_2004 | in_2005 | in_2006 | in_2007 | in_2008 | in_2010 |
| 1084 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
|---------+---------+---------+---------+----------+----------+----------+----------+----------|
| in_2011 | in_2012 | in_2013 | in_2014 | out_2002 | out_2003 | out_2004 | out_2005 | out_2006 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|----------------------------------------------------------------------------------------------|
| out_2007 | out_2008 | out_2010 | out_2011 | out_2012 | out_2013 | out_2014 |
| 0 | 0 | 0 | 0 | 0 | 0 | 2 |
+----------------------------------------------------------------------------------------------+
+----------------------------------------------------------------------------------------------+
2. | firm_id | in_2002 | in_2003 | in_2004 | in_2005 | in_2006 | in_2007 | in_2008 | in_2010 |
| 1094 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|---------+---------+---------+---------+----------+----------+----------+----------+----------|
| in_2011 | in_2012 | in_2013 | in_2014 | out_2002 | out_2003 | out_2004 | out_2005 | out_2006 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
|----------------------------------------------------------------------------------------------|
| out_2007 | out_2008 | out_2010 | out_2011 | out_2012 | out_2013 | out_2014 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 |
+----------------------------------------------------------------------------------------------+

I can't see that any loops are needed here. But what are needed are precise and explicit rules. Here a firm enters an industry whenever its first year on record is after the first year in the dataset and a firm leaves whenever its last year on record is before the last year in the dataset. Further, a firm leaves the industry if the next year on record is more than a year later, and a firm enters the industry if the previous year on record is more than a year previously. That allows leaving and re-entering, unlikely though such changes may be.
summarize fyear, meanonly
local first = r(min)
local last = r(max)
bysort firm_id industry (fyear) : generate enter = (fyear > `first') if _n == 1
by firm_id industry : replace enter = (fyear - fyear[_n-1]) > 1 if _n > 1
by firm_id industry : generate leave = fyear < `last' if _n == _N
by firm_id industry : replace leave = (fyear[_n+1] - fyear) > 1 if _n < _N
table fyear firm_id, c(sum enter sum leave)

Related

Looking to count zero values from right to left until non-zero values appears

I have a large table of monthly values.
I am looking to count the zero values from right to left, stopping once a non-zero value occurs.
I want the last column to display these values.
| JAN | FEB | MAY | APR | MAY | JUN | Value I need |
Ben | 10 | 10 | 10 | 0 | 0 | 0 | =3 |
Tim | 0 | 0 | 10 | 10 | 10 | 0 | =1 |
Susan | 0 | 0 | 5 | 10 | 0 | 10 | =0 |
Frank | 10 | 0 | 0 | 10 | 10 | 10 | =0 |
Many thanks for any help!
I don't think you need anything very sophisticated - just find last column which is non-zero:
=ArrayFormula(columns(B:G)-max(if(B2:G2>0,column(B:G)-column(A:A),0)))
try:
=ARRAYFORMULA(IF(A2:A="",,LEN(REGEXREPLACE(INDEX(SPLIT(TRANSPOSE(QUERY(TRANSPOSE(
IF(VLOOKUP(A2:A, A2:G, TRANSPOSE(SORT(TRANSPOSE(COLUMN(B:G)), 1, 0)), 0)=0,
"♦", "♥")),,9^9)), "♥", , 0),,1), "^ .+| |#.+", ))))

Data Warehouse dimension for schedules (Dimensional Modeling)

I have not found an example or a way of building a dimension that contains schedule attributes. For example, in my scenario I'm building a data warehouse that will help to gather analytics on podcast/radio show episodes.
We have the following:
dim_episode
dim_podcast_show
dim_date
fact_user_daily_activity
And I'm trying to add another dimension that contains schedule attributes about the podcast_show, for example, some shows air their episodes every day, others tuesdays and thursdays, others only saturdays.
dim_show_schedule (Option 1)
| schedule_key | show_key | time | sunday_flag | monday_flag | tuesday_flag | wednesday_flag | thursday_flag | friday_flag | saturday_flag |
|--------------|----------|-------|-------------|-------------|--------------|----------------|---------------|-------------|---------------|
| 1 | 0 | 00:30 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 1 | 12:30 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
| 3 | 2 | 21:00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
However, would it be better to have a bridge table with something like:
bridge_show_schedule (Option 2)
| show_key | day_key |
|----------|---------|
| 0 | 2 |
| 0 | 4 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
dim_show_schedule (Option 3) (suggested by #nsousa)
| schedule_key | show_key | time | day |
|--------------|----------|-------|-------------|
| 1 | 0 | 00:30 | tuesday |
| 1 | 0 | 00:30 | thursday |
| 2 | 1 | 12:30 | monday |
| 2 | 1 | 12:30 | tuesday |
| 2 | 1 | 12:30 | wednesday |
| 2 | 1 | 12:30 | thursday |
| 2 | 1 | 12:30 | friday |
| 3 | 2 | 21:00 | saturday |
I've searched in Kimball's Data warehouse lifecycle toolkit and could not find an example on this use case.
Any thoughts?
If you keep a dimension with a string attribute saying which days it’s on, e.g., “M,W,F”, the most entries you have are 2^7, 128. A bridge table is an unnecessary complication.
Option 1
You can create a scheduled dimension that has a unique record for every possible schedule (128 daily combinations) combined with every reasonable start time. Using 5 minute intervals would still be less than 37k rows which is trivial for a dimension.
Option 2
If you want to leverage a date dimension instead, create a "Scheduled" fact that relate the show dimension to the date dimension for that future date. This would be handled in your ETL process to map the relationship. Your date dimension should already have the week and day of week logic included. You could also leverage your Show duration attribute to create a semi-additive calculated measure to allow you to easily get the total programming for the period.
I would opt for Option 2 as it provides many more possibilities for analytics.

Calculate a bunch of data to display on stacked bar

I'm struggeling with creating my first chart.
i have a dataset of ordinal scaled data from a survey.
There i have several question with the possible answer from 1 - 5.
So have around 110 answers from different persons which i want to collect and show in a stacked bar.
Those data looks like:
| taste | region | brand | price |
| 1 | 3 | 4 | 2 |
| 1 | 1 | 5 | 1 |
| 1 | 3 | 4 | 3 |
| 2 | 2 | 5 | 1 |
| 1 | 1 | 4 | 5 |
| 5 | 3 | 5 | 2 |
| 1 | 5 | 5 | 2 |
| 2 | 4 | 1 | 3 |
| 1 | 3 | 5 | 4 |
| 1 | 4 | 4 | 5 |
...
to can display that in a stacked bar chart, i need to sum that.
so i know at the end it need to be calculated like:
| | taste | region | brand | price |
| 1 | 60 | 20 | 32 | 12 |
| 2 | 23 | 32 | 54 | 22 |
| 3 | 24 | 66 | 36 | 65 |
| 4 | 55 | 68 | 28 | 54 |
| 5 | 10 | 10 | 12 | 22 |
(this is just to demonstarte, the values are not correct)
Or somehow there is already a function for it on spss but i have now idea where an how.
Any advice how to do that?
I can't think of a single command but there are many ways to get to where you want. Here's one:
first recreating your sample data:
data list list/ taste region brand price .
begin data
1 3 4 2
1 1 5 1
1 3 4 3
2 2 5 1
1 1 4 5
5 3 5 2
1 5 5 2
2 4 1 3
1 3 5 4
1 4 4 5
end data.
Now counting the values for each row:
vector t(5) r(5) b(5) p(5).
* the vector command is only nescessary so the new variables will be ordered compfortably for the following parts.
do repeat vl= 1 to 5/t=t1 to t5/r=r1 to r5/b=b1 to b5/p=p1 to p5.
compute t=(taste=vl).
compute r=(region=vl).
compute b=(brand=vl).
compute p=(price=vl).
end repeat.
Now we can aggregate and restructure to arrive to the the exact data structure you specified:
aggregate /outfile=* /break= /t1 to t5 r1 to r5 b1 to b5 p1 to p5 = sum(t1 to p5).
varstocases /make taste from t1 to t5 /make region from r1 to r5
/make brand from b1 to b5/ make price from p1 to p5/index=val(taste).
compute val = char.substr(val,2,1).
alter type val(f1).

weka gives 100% correctly classified instances for every dataset

I'm not able to get accuracy, as every dataset I provide provides 100% accuracy for every classifier algorithm I apply. My data set is of 10 people.
It gives the same accuracy for naive bayes, J48, JRip classifier algorithm.
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| id | name | q1 | q2 | q3 | m1 | m2 | tut | fl | proj | fexam | total | grade |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| 1 | abv | 5 | 5 | 5 | 13 | 13 | 4 | 8 | 7 | 40 | 100 | p |
| 2 | ca | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 40 | 48 | f |
| 3 | ga | 4 | 2 | 3 | 5 | 10 | 4 | 5 | 6 | 20 | 59 | f |
| 4 | ui | 5 | 4 | 4 | 12 | 13 | 3 | 7 | 7 | 39 | 94 | p |
| 5 | pa | 4 | 1 | 1 | 4 | 3 | 2 | 4 | 5 | 22 | 46 | f |
| 6 | la | 2 | 3 | 1 | 1 | 2 | 0 | 4 | 2 | 11 | 26 | f |
| 7 | ka | 5 | 4 | 1 | 3 | 3 | 1 | 6 | 4 | 24 | 51 | f |
| 8 | ma | 5 | 3 | 3 | 9 | 8 | 4 | 8 | 0 | 20 | 60 | p |
| 9 | ash | 2 | 5 | 5 | 11 | 12 | 3 | 7 | 6 | 30 | 81 | p |
| 10 | opo | 4 | 2 | 1 | 13 | 1 | 3 | 7 | 3 | 35 | 69 | p |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
Make sure to not include any unique identifier column.
Also don't include the total.
Most likely, the classifiers learned that "name" is a good predictor and/or that you need total > 59 points total to pass.
I suggest you even withhold at least one exercise because of that - some classifiers will still learn that the sum of the individual points is necessary to pass.
I assume you want to find out if one part is most indicative of passing, i.e. "if you do well on part 3, you will likely pass". But to answer this question, you need to account for e.g. different amount of points per question, etc. - otherwise, your predictor will just identify which question has the most points...
Also, 10 is a much too small sample size!
You can see from the output that is displayed that the tree that J48 generated used only the variable fl, so I do not think that you have the problem that #Anony-Mousse referred to.
I notice that you are testing on the training set (see the "Test Options" radio buttons at upper left of the GUI). That almost always overestimates the accuracy. What you are seeing is overfitting. Instead, use cross-validation to get a better estimate of the accuracy you could expect on new data. With only 10 data points, you should use either 10 folds or 5.
Try testing your model on cross-validation on "k splits" or Percentage split.
Generally in Percentage Split: Training set is of 2/3 of dataset and Test set is 1/3.
Also, What I feel that your dataset is very small... There are chances of high accuracy in that case.

Normalization, changing it to 1nf, 2nf and 3nf

INVOICE
So i have to put this into 1NF, 2NF and 3NF
PROD_NUM PROD_LABEL PROD_PRICE
AA-E3422QW ROTARY SANDER 49.95
AA-E3422QW ROTARY SANDER 49.95
QD-300932X 0.25IN. DRILL BIT 3.45
RU-95748G BAND SAW 33.99
GH-778345P POWER DRILL 87.75
VEN_CODE VEN_NAME
211 NEVERFAIL, INC
211 NEVERFAIL, INC
211 NEVERFAIL, INC
309 BEGOOD, INC
157 TOUGHGO, INC
So far i have these as my 2NF. Am i going right? And how do i put the table into 3NF ?
So my 2nf will be like this ?2NF TABLE IMAGE
I think the picture you were given is considered 1NF.
And you initially showed 3NF, but you'll need an additional table to reference which Product is by what Vendor as well as modify the invoice table.
Vendor - Unique list of vendors
VEN_ID | VEN_CODE | VEN_NAME
-------|----------|---------------
1 | 211 | NEVERFAIL, INC
2 | 309 | BEGOOD, INC
3 | 157 | TOUGHGO, INC
Product - Unique list of products
PROD_ID | PROD_NUM | PROD_LABEL | PROD_PRICE
--------|------------|-------------------|-----------
1 | AA-E3422QW | ROTARY SANDER | 49.95
2 | QD-300932X | 0.25IN. DRILL BIT | 3.45
3 | RU-95748G | BAND SAW | 33.99
4 | GH-778345P | POWER DRILL | 87.75
Vendor_Product - the mapping between products and vendors
VEN_ID | PROD_ID
-------|----------
1 | 1
1 | 2
2 | 3
3 | 4
Purchases - The transactions that happened
PURCH_ID | INV_NUM | SALE_DATE | PROD_ID | QUANT_SOLD
---------|---------|-------------|---------|------------
1 | 211347 | 15-JAN-2006 | 1 | 1
2 | 211347 | 15-JAN-2006 | 2 | 8
3 | 211347 | 15-JAN-2006 | 3 | 1
4 | 211348 | 15-JAN-2006 | 1 | 2
5 | 211349 | 16-JAN-2006 | 4 | 1
I think that is good, but it can be split again.
Invoices - A unique list of invoices
INV_ID | INV_NUM | SALE_DATE
--------|---------|-------------
1 | 211347 | 15-JAN-2006
2 | 211348 | 15-JAN-2006
3 | 211349 | 16-JAN-2006
Purchases - The transactions that happened
PURCH_ID | INV_ID | PROD_ID | QUANT_SOLD
---------|--------|---------|---------
1 | 1 | 1 | 1
2 | 1 | 2 | 8
3 | 1 | 3 | 1
4 | 2 | 1 | 2
5 | 3 | 4 | 1
To get 2NF, combine the Vendor information back into the Product table. With these columns
PROD_ID | PROD_NUM | PROD_LABEL | PROD_PRICE | VEN_CODE | VEN_NAME
In this case, the Vendor and Vendor_Product tables aren't needed.

Resources