Moving Average across Variables in Stata - foreach

I have a panel data set for which I would like to calculate moving averages across years.
Each year is a variable for which there is an observation for each state, and I would like to create a new variable for the average of every three year period.
For example:
P1947=rmean(v1943 v1944 v1945), P1947=rmean(v1944 v1945 v1946)
I figured I should use a foreach loop with the egen command, but I'm not sure about how I should refer to the different variables within the loop.
I'd appreciate any guidance!

This data structure is quite unfit for purpose. Assuming an identifier id you need to reshape, e.g.
reshape long v, i(id) j(year)
tsset id year
Then a moving average is easy. Use tssmooth or just generate, e.g.
gen mave = (L.v + v + F.v)/3
or (better)
gen mave = 0.25 * L.v + 0.5 * v + 0.25 * F.v
More on why your data structure is quite unfit: Not only would calculation of a moving average need a loop (not necessarily involving egen), but you would be creating several new extra variables. Using those in any subsequent analysis would be somewhere between awkward and impossible.
EDIT I'll give a sample loop, while not moving from my stance that it is poor technique. I don't see a reason behind your naming convention whereby P1947 is a mean for 1943-1945; I assume that's just a typo. Let's suppose that we have data for 1913-2012. For means of 3 years, we lose one year at each end.
forval j = 1914/2011 {
local i = `j' - 1
local k = `j' + 1
gen P`j' = (v`i' + v`j' + v`k') / 3
}
That could be written more concisely, at the expense of a flurry of macros within macros. Using unequal weights is easy, as above. The only reason to use egen is that it doesn't give up if there are missings, which the above will do.
FURTHER EDIT
As a matter of completeness, note that it is easy to handle missings without resorting to egen.
The numerator
(v`i' + v`j' + v`k')
generalises to
(cond(missing(v`i'), 0, v`i') + cond(missing(v`j'), 0, v`j') + cond(missing(v`k'), 0, v`k')
and the denominator
3
generalises to
!missing(v`i') + !missing(v`j') + !missing(v`k')
If all values are missing, this reduces to 0/0, or missing. Otherwise, if any value is missing, we add 0 to the numerator and 0 to the denominator, which is the same as ignoring it. Naturally the code is tolerable as above for averages of 3 years, but either for that case or for averaging over more years, we would replace the lines above by a loop, which is what egen does.

There is a user written program that can do that very easily for you. It is called mvsumm and can be found through findit mvsumm
xtset id time
mvsumm observations, stat(mean) win(t) gen(new_variable) end

Related

Google sheets, summing over a formula?

Apologies if this is a little unclear, or this question has been asked already. It's a little difficult to explain, but I've bolded my question - it's essentially about shortening formulas.
I'm running a payment plan waterfall model. My code works, but, it's er, you know...
IF($K2=Q$1,Assumptions!$E$56*($N2+$O2),IF(AND($K2<Q$1,$M2>Q$1),Assumptions!$E$56*$P2/$L2,0)) + IF(edate($K2,1)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,1)<Q$1,edate($M2,1)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,2)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,2)<Q$1,edate($M2,2)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,3)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,3)<Q$1,edate($M2,3)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,4)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,4)<Q$1,edate($M2,4)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,5)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,5)<Q$1,edate($M2,5)>Q$1),Assumptions!$F$56*$P2/$L2,0)) + IF(edate($K2,6)=Q$1,Assumptions!$F$56*($N2+$O2),IF(AND(edate($K2,6)<Q$1,edate($M2,6)>Q$1),Assumptions!$F$56*$P2/$L2,0))
...pretty long.
Essentially what's going on is: the assumption is, when we launch a product, we sell say 80% in the first month, and 2.5% every subsequent month until we each 100%.
I'd like the 80% and the 2.5% to be variables (listed as Assumptions!$E$56 and Assumptions!$E$56) here.
Obviously a little long. But noticed after the first IF clause, the subsequent ones are actually identical, the only difference being the number inside edate(__,2), edate(__,3)...
So my question is - can this code be tidied up into some sort of for loop? Python would make it pretty simple to increment over the variable edate(__,i) and sum over i = 1:6.
Sure, there is. Usually looping is emulated by Sequence(N), which makes an array of numbers from [1,N] vertically, which is somewhat like your Python range. Then you can do stuff to it as an ArrayFormula.
In your case, you end up with two terms: your initial term using $E, and all the looped stuff using $F. I see 6 terms, so I will use Sequence(6):
=IF(
$K2=Q$1,
Assumptions!$E$56*($N2+$O2),
IF(
AND(
$K2<Q$1,
$M2>Q$1
),
Assumptions!$E$56*$P2/$L2,
0
)
) + ArrayFormula(SUM(
IF(
edate($K2,SEQUENCE(6))=Q$1,
Assumptions!$F$56*($N2+$O2),
IF(
(edate($K2,SEQUENCE(6))<Q$1)*
(edate($M2,SEQUENCE(6))>Q$1),
Assumptions!$F$56*$P2/$L2,
0
)
)
))
And if you want, you can give your Assumption values names using named ranges.

Performing an "online" linear interpolation

I have a problem where I need to do a linear interpolation on some data as it is acquired from a sensor (it's technically position data, but the nature of the data doesn't really matter). I'm doing this now in matlab, but since I will eventually migrate this code to other languages, I want to keep the code as simple as possible and not use any complicated matlab-specific/built-in functions.
My implementation initially seems OK, but when checking my work against matlab's built-in interp1 function, it seems my implementation isn't perfect, and I have no idea why. Below is the code I'm using on a dataset already fully collected, but as I loop through the data, I act as if I only have the current sample and the previous sample, which mirrors the problem I will eventually face.
%make some dummy data
np = 109; %number of data points for x and y
x_data = linspace(3,98,np) + (normrnd(0.4,0.2,[1,np]));
y_data = normrnd(2.5, 1.5, [1,np]);
%define the query points the data will be interpolated over
qp = [1:100];
kk=2; %indexes through the data
cc = 1; %indexes through the query points
qpi = qp(cc); %qpi is the current query point in the loop
y_interp = qp*nan; %this will hold our solution
while kk<=length(x_data)
kk = kk+1; %update the data counter
%perform online interpolation
if cc<length(qp)-1
if qpi>=y_data(kk-1) %the query point, of course, has to be in-between the current value and the next value of x_data
y_interp(cc) = myInterp(x_data(kk-1), x_data(kk), y_data(kk-1), y_data(kk), qpi);
end
if qpi>x_data(kk), %if the current query point is already larger than the current sample, update the sample
kk = kk+1;
else %otherwise, update the query point to ensure its in between the samples for the next iteration
cc = cc + 1;
qpi = qp(cc);
%It is possible that if the change in x_data is greater than the resolution of the query
%points, an update like the above wont work. In this case, we must lag the data
if qpi<x_data(kk),
kk=kk-1;
end
end
end
end
%get the correct interpolation
y_interp_correct = interp1(x_data, y_data, qp);
%plot both solutions to show the difference
figure;
plot(y_interp,'displayname','manual-solution'); hold on;
plot(y_interp_correct,'k--','displayname','matlab solution');
leg1 = legend('show');
set(leg1,'Location','Best');
ylabel('interpolated points');
xlabel('query points');
Note that the "myInterp" function is as follows:
function yi = myInterp(x1, x2, y1, y2, qp)
%linearly interpolate the function value y(x) over the query point qp
yi = y1 + (qp-x1) * ( (y2-y1)/(x2-x1) );
end
And here is the plot showing that my implementation isn't correct :-(
Can anyone help me find where the mistake is? And why? I suspect it has something to do with ensuring that the query point is in-between the previous and current x-samples, but I'm not sure.
The problem in your code is that you at times call myInterp with a value of qpi that is outside of the bounds x_data(kk-1) and x_data(kk). This leads to invalid extrapolation results.
Your logic of looping over kk rather than cc is very confusing to me. I would write a simple for loop over cc, which are the points at which you want to interpolate. For each of these points, advance kk, if necessary, such that qp(cc) is in between x_data(kk) and x_data(kk+1) (you can use kk-1 and kk instead if you prefer, just initialize kk=2 to ensure that kk-1 exists, I just find starting at kk=1 more intuitive).
To simplify the logic here, I'm limiting the values in qp to be inside the limits of x_data, so that we don't need to test to ensure that x_data(kk+1) exists, nor that x_data(1)<pq(cc). You can add those tests in if you wish.
Here's my code:
qp = [ceil(x_data(1)+0.1):floor(x_data(end)-0.1)];
y_interp = qp*nan; % this will hold our solution
kk=1; % indexes through the data
for cc=1:numel(qp)
% advance kk to where we can interpolate
% (this loop is guaranteed to not index out of bounds because x_data(end)>qp(end),
% but needs to be adjusted if this is not ensured prior to the loop)
while x_data(kk+1) < qp(cc)
kk = kk + 1;
end
% perform online interpolation
y_interp(cc) = myInterp(x_data(kk), x_data(kk+1), y_data(kk), y_data(kk+1), qp(cc));
end
As you can see, the logic is a lot simpler this way. The result is identical to y_interp_correct. The inner while x_data... loop serves the same purpose as your outer while loop, and would be the place where you read your data from wherever it's coming from.

Display polynomials in reverse order in SageMath

So I would like to print polynomials in one variable (s) with one parameter (a), say
a·s^3 − s^2 - a^2·s − a + 1.
Sage always displays it with decreasing degree, and I would like to get something like
1 - a - a^2·s - s^2 + a·s^3
to export it to LaTeX. I can't figure out how to do this... Thanks in advance.
As an alternative to string manipulation, one can use the series expansion.
F = a*s^3 - s^2 - a^2*s - a + 1
F.series(s, F.degree(s)+1)
returns
(-a + 1) + (-a^2)*s + (-1)*s^2 + (a)*s^3
which appears to be what you wanted, save for some redundant parentheses.
This works because (a) a power series is ordered from lowest to highest coefficients; (b) making the order of remainder greater than the degree of the polynomial ensures that the series is just the polynomial itself.
This is not easy, because the sort order is defined in Pynac, a fork of Ginac, which Sage uses for its basic symbolic manipulation. However, depending on what you need, it is possible programmatically:
sage: F = 1 + x + x^2
sage: "+".join(map(str,sorted([f for f in F.operands()],key=lambda exp:exp.degree(x))))
'1+x+x^2'
I don't know whether this sort of thing is powerful enough for your needs, though. You may have to traverse the "expression tree" quite a bit but at least your sort of example seems to work.
sage: F = a + a^2*x + x^2 - a*x^2
sage: "+".join(map(str,sorted([f for f in F.operands()],key=lambda exp:exp.degree(x))))
'a+a^2*x+-a*x^2+x^2'
Doing this in a short statement requires a number of Python tricks like this, which are very well worth learning if you are going to use Sage (or Numpy, or pandas, or ...) a fair amount.

Big-O of an operation over a single linked list

Suppose you've got a single linked list of size N, and you want to perform an operation on every element, beginning at the end.
I've come up with the following pseudocode:
while N > 0
Current = LinkedList
for 0 to N
Current = Current.tail
end
Operation(Current.head)
N := N-1
end
Now I've got to determine which Big-O this algorithm is.
Supposing that Operation() is O(1), I think it's something like this:
N + (N-1) + (N-2) + ... + (N-(N-1)) + 1
But I'm not sure what Big-O that actually is. I think it is definitely smaller than O(N^2), but I don't think you can say its O(N) either ...
Your equation is basically that of the triangular numbers, and sums to N(N+1)/2. I'll leave you to determine the O() from that!
A quicker way to do this is to construct a new list that is the reverse of the original list, and then perform the operations on that.
Your algorithm is O(n^2) as you suggest in your post. You can do it in O(n), though.
It's important to remember that Big-O notation is an upper bound on the algorithm's time complexity.
1+2+3+...+n = n*(n+1)/2 = 0.5*n^2+O(n)
This is O(n^2), and O(n^2) is tight, i.e. there is no lower runtime order that'd contain your runtime.
A faster algorithm that works from front-to-back could have O(n) instead of O(n^2)
Your runtime analysis is correct, the runtime is 1 + 2 + ... + N which is a sum of the arithmetic progression and therefore = (N²-N) / 2.

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

Resources