Lagging and leading variables non-panel data Stata - time-series

For the purpose of an event study, I would like to create lagging and leading variables in my data set. Unfortunately, my data is not in a balanced panel format.
My data set looks as:
clear
input id year month binary str7 implement
28845421 2007 3 0 2008-1
29118744 2018 10 1 2012-6
29118744 2016 7 1 2016-7
29183010 2019 3 1 2010-1
29320027 2013 3 0 2015-2
end
. list
+---------------------------------------------+
| id year month binary implem~t |
|---------------------------------------------|
1. | 2.88e+07 2007 3 0 2008-1 |
2. | 2.91e+07 2018 10 1 2012-6 |
3. | 2.91e+07 2016 7 1 2016-7 |
4. | 2.92e+07 2019 3 1 2010-1 |
5. | 2.93e+07 2013 3 0 2015-2 |
+---------------------------------------------+
The variable binary equals 1 at the point when its year and month combination has reached the implement date. Each observation is represented by an identifier id.
The goal is to create lagging and leading variables of the binary: binary-5, binary-4, ..., binary+1, binary+2, .... In other words, I want to shift implement by n years of increments / decrement to create the new binary variables.
How can I create such variables in Stata?

Related

Calculate time differences and sum duration

I am trying to create a small "app" using tasker on my android phone that am supposed to track my workhours and over/under-time. I have managed to get tasker to send timestamps on the start/end of each workday and are writing them to a google sheet so it gets recorded like:
<Not implemented> <Not implemented>
| A | B | C | D | E | F |
| 2020-01-29 | 07:24 | 16:33 | 00:09 | | -02:51 |
| 2020-01-30 | 07:00 | 12:00 | -03:00 | | |
Where the "D" column is the difference between ordinary workhours (8) and actually registred hours.
The "F" column should summarize the "D" column and show the sum of all values.
The data in the three first columns are beeing sent correctly but I cant figure out how to set up formulas so that the values for the "D" column is added and and same thing with the cell in the "F" column. I have been trying to change to different formats and tried creating my own formats to but do not understand how to get it to work.
I'm getting a different result than you in D1. I wonder if you're also accounting for a lunch hour (so subtract 9 instead of 8), but these formulas worked for me:
in Column D: =(C1-B1)-(8/24)
in Cell F1: =sum(D1:D2)
Column D and Cell F1 are formatted as Time > Duration.
Here's the result:

Speed up filter and sum results based on multiple criteria?

I am filtering then summing transaction data based on a date range and if a column contains one of multiple possible values.
example data
A | B | C | D
-----------|-----|---------------------------------------------------|-------
11/12/2017 | POS | 6443 09DEC17 C , ALDI 84 773 , OFFERTON GB | -3.87
18/12/2017 | POS | 6443 16DEC17 C , CO-OP GROUP 108144, STOCKPORT GB | -6.24
02/01/2018 | POS | 6443 01JAN18 , AXA INSURANCE , 0330 024 1229 GB | -220.10
I'm currently have the following formula, that works but is really quite slow.
=sum(
iferror(
filter(
Transactions!$D:$D,
Transactions!$A:$A>=date(A2,B2,1),
Transactions!$A:$A<=date(A2,B2,31),
regexmatch(Transactions!$C:$C, "ALDI|LIDL|CO-OP GROUP 108144|SPAR|SAINSBURYS S|SAINSBURY'S S|TESCO STORES|MORRISON|MARKS AND SPENCER , HAZEL GROVE|HAZELDINES|ASDA")
)
,0
)
) * -1
The formula is on a seperate sheet that is just a simple view of the results breakdown for each month of a year
| A | B | C
--|------|----|----------
1 | 2017 | 12 | <formula> # December 2017
2 | 2017 | 11 | <formula> # November 2017
3 | 2017 | 10 | <formula> # October 2017
Is there a way to achieve this that would be more performant?
I tried using ArrayFormula and SUMIF which works for the string criteria but to add more criteria with SUMIFS for the date, it stops working.
I couldn't figure out a way to utilize INDEX and/or MATCH
=query(filter( {Transactions!$A:$A,
Transactions!$D:$D},
regexmatch(Transactions!$C:$C, "ALDI|LIDL|CO-OP GROUP 108144|SPAR|SAINSBURYS S|SAINSBURY'S S|TESCO STORES|MORRISON|MARKS AND SPENCER , HAZEL GROVE|HAZELDINES|ASDA")
), "select year(Col1), month(Col1)+1, -1*sum(Col2) group by year(Col1), month(Col1)+1", 0)
The result is a table like this:
year() sum(month()1()) sum
2017 11 3.87
2017 12 6.24
Add labels if needed. Sample query text with labels:
"select year(Col1), month(Col1)+1, -1*sum(Col2) group by year(Col1), month(Col1)+1 label year(Col1) 'Year', month(Col1)+1 'Month'"
The result:
Year Month sum
2017 11 3.87
2017 12 6.24
Explanations
the single formula report reduces the number of filter functions, so must work faster.
Used query syntax. more info here.

Condition for memory access conflict in memory-banked vector processors

The Hennessy-Patterson book on Computer Architecture (Quantitative Approach 5ed) says that in a vector architecture with multiple memory banks, a bank conflict can happen if the following condition is met (Page 279 in 5ed):
(Number of banks) / LeastCommonMultiple(Number of banks, Stride) < Bank busy time
However, I think it should be GreatestCommonFactor instead of LCM, because memory conflict would occur if the effective number of banks you have is less than the busy time. By effective number of banks I mean this - let's say you have 8 banks, and a stride of 2. Then effectively you have 4 banks, because the memory accesses will be lined up only at four banks (e.g, let's say your accesses are all even numbers, starting from 0, then your accesses will be lined up at banks 0,2,4,6).
In fact, this formula even fails for the example given right below it. Suppose we have 8 memory banks with busy time of 6 clock cycles, with total memory latency of 12 clock cycles, how long will it take to complete a 64-element vector load with stride of 1? - Here they calculate the time as 12+64=76 clock cycles. However, memory bank conflict will occur according to the condition given, so we clearly can't have one access per cycle (64 in the equation).
Am I getting it wrong, or has the wrong formula managed to survive 5 editions of this book (unlikely)?
GCD(banks, stride) should come into it; your argument about that is correct.
Let's try this for a few different strides and see what we get,
for number of banks = b = 8.
# generated with the calc(1) function
define f(s) { print s, " | ", lcm(s,8), " | ", gcd(s,8), " | ", 8/lcm(s,8), " | ", 8/gcd(s,8) }`
stride | LCM(s,b) | GCF(s,b) | b/LCM(s,b) | b/GCF(s,b)
1 | 8 | 1 | 1 | 8 # 8 < 6 = false: no conflict
2 | 8 | 2 | 1 | 4 # 4 < 6 = true: conflict
3 | 24 | 1 | ~0.333 | 8 # 8 < 6 = false: no conflict
4 | 8 | 4 | 1 | 2 # 2 < 6 = true: conflict
5 | 40 | 1 | 0.2 | 8
6 | 24 | 2 | ~0.333 | 4
7 | 56 | 1 | ~0.143 | 8
8 | 8 | 8 | 1 | 1
9 | 72 | 1 | ~0.111 | 8
x >=8 2^0..3 <=1 1 2 4 or 8
b/LCM(s,b) is always <=1, so it always predicts conflicts.
I think GCF (aka GCD) looks right for the stride values I've looked at so far. You only have a problem if the stride doesn't distribute the accesses over all the banks, and that's what b/GCF(s,b) tells you.
Stride = 8 should be the worst-case, using the same bank every time. gcd(8,8) = lcm(8,8) = 8. So both expressions give 8/8 = 1 which is less than the bank busy/recovery time, thus correctly predicting conflicts.
Stride=1 is of course the best case (no conflicts if there are enough banks to hide the busy time). gcd(8,1) = 1 correctly predicts no conflicts: (8/1 = 8, which is not less than 6). lcm(8,1) = 8. (8/8 < 6 is true) incorrectly predicts conflicts.

How I can sort my results based on the order that sets the user?

In my database I have many columns that will summarize in:
Code input
Amount 1
Amount 2
Code Phase
Code Sector
Code Group
Take an example of the rows that I have:
+--------------+----------+----------+-------+--------+-------+
| code_input | amount_1 | amount_2 | phase | sector | group |
+--------------+----------+----------+-------+--------+-------+
| 0171090150 | 22 | 14 | 09 | 90 | 10 |
| 0258212527 | 12 | 99 | 08 | 30 | 20 |
| 0359700504 | 30 | 10 | 09 | 20 | 20 |
+--------------+----------+----------+-------+--------+-------+
The user has a place in which he can decide who goes first, second, third and fourth. So, he can decide if the code_phase is first, second cod_sector, third code_group, finally the code_input. Or the user can play with that order (cod_sector first, code_phase second, etc).
In my database, inputs are those amounts recorded. Therefore, if a sector includes 2 inputs, the total of this sector is the sum of these two inputs.
Example of result with one order:
# => Order: Phase, Sector, Group, Input
- Phase 09 52 24
- Sector 90 22 14
- Group 10 22 14
- Input 0171090150 22 14
- Sector 20 30 10
- Group 20 30 10
- Input 0359700504 30 10
- Phase 08 12 99
- Sector 30 12 99
- Group 20 12 99
- Input 0258212527 12 99
I use Ruby on Rails and I have a code of 3579 lines for all combinations that users can put. But it is a code that is not maintainable, cumbersome and sometimes I get confused myself where there may be a mistake.
So, I wonder if any of you can help me know if there is any gem that can help me; may not do the whole order; but help me greatly to optimize my code Or if you recommend a method or algorithm that can make this order.
Sorry for my english.

Stata: Convert date, quarter to year

I have a time series dataset with quarterly observations, which I want to collapse to an annual series. For that, I need to transform my date variable first.
It looks like
. list date in 1/5
+--------+
| date |
|--------|
1. | 1991q1 |
2. | 1991q2 |
3. | 1991q3 |
4. | 1991q4 |
5. | 1992q1 |
+--------+
Hence, to collapse, I want date (or date2) to be 1991, 1991, 1991, 1991, 1992 etc.
Once I have that, I could use collapse or tscollapse to turn my dataset into annual data.
// create some example data
. clear all
. set obs 5
obs was 0, now 5
. gen date = 123 + _n
. format date %tq
// create the yearly date
. gen date2 = yofd(dofq(date))
// admire the result
. list
+----------------+
| date date2 |
|----------------|
1. | 1991q1 1991 |
2. | 1991q2 1991 |
3. | 1991q3 1991 |
4. | 1991q4 1991 |
5. | 1992q1 1992 |
+----------------+
Another way is just to remember that years and quarters are just integers. A little consultation of the documentation and a little fiddling around yield
. gen Y = 1960 + floor(Q/4)
as a conversion rule to get years from Stata quarterly dates. Formatting year as a yearly date is then permissible but superfluous.

Resources