Matlab, Econometrics toolbox - Simulate ARIMA with deterministic time-varying variance - time-series

DISCLAIMER: This question is only for those who have access to the econometrics toolbox in Matlab.
The Situation: I would like to use Matlab to simulate N observations from an ARIMA(p, d, q) model using the econometrics toolbox. What's the difficulty? I would like the innovations to be simulated with deterministic, time-varying variance.
Question 1) Can I do this using the in-built matlab simulate function without altering it myself? As near as I can tell, this is not possible. From my reading of the docs, the innovations can either be specified to have a constant variance (ie same variance for each innovation), or be specified to be stochastically time-varying (eg a GARCH model), but they cannot be deterministically time-varying, where I, the user, choose their values (except in the trivial constant case).
Question 2) If the answer to question 1 is "No", then does anyone see any reason why I can't edit the simulate function from the econometrics toolbox as follows:
a) Alter the preamble such that the function won't throw an error if the Variance field in the input model is set to a numeric vector instead of a numeric scalar.
b) Alter line 310 of simulate from:
E(:,(maxPQ + 1:end)) = Z * sqrt(variance);
to
E(:,(maxPQ + 1:end)) = (ones(NumPath, 1) * sqrt(variance)) .* Z;
where NumPath is the number of paths to be simulated, and it can be assumed that I've included an error trap to ensure that the (input) deterministic variance path stored in variance is of the right length (ie equal to the number of observations to be simulated per path).
Any help would be most appreciated. Apologies if the question seems basic, I just haven't ever edited one of Mathwork's own functions before and didn't want to do something foolish.
UPDATE (2012-10-18): I'm confident that the code edit I've suggested above is valid, and I'm mostly confident that it won't break anything else. However it turns out that implementing the solution is not trivial due to file permissions. I'm currently talking with Mathworks about the best way to achieve my goal. I'll post the results here once I have them.

It's been a week and a half with no answer, so I think I'm probably okay to post my own answer at this point.
In response to my question 1), no, I have not found anyway to do this with the built-in matlab functions.
In response to my question 2), yes, what I have posted will work. However, it was a little more involved than I imagined due to matlab file permissions. Here is a step-by-step guide:
i) Somewhere in your matlab path, create the directory #arima_Custom.
ii) In the command window, type edit arima. Copy the text of this file into a new m file and save it in the directory #arima_Custom with the filename arima_Custom.m.
iii) Locate the econometrics toolbox on your machine. Once found, look for the directory #arima in the toolbox. This directory will probably be located (on a Linux machine) at something like $MATLAB_ROOT/toolbox/econ/econ/#arima (on my machine, $MATLAB_ROOT is at /usr/local/Matlab/R2012b). Copy the contents of #arima to #arima_Custom, except do NOT copy the file arima.m.
iv) Open arima_Custom for editing, ie edit arima_Custom. In this file change line 1 from:
classdef (Sealed) arima < internal.econ.LagIndexableTimeSeries
to
classdef (Sealed) arima_Custom < internal.econ.LagIndexableTimeSeries
Next, change line 406 from:
function OBJ = arima(varargin)
to
function OBJ = arima_Custom(varargin)
Now, change line 993 from:
if isa(OBJ.Variance, 'double') && (OBJ.Variance <= 0)
to
if isa(OBJ.Variance, 'double') && (sum(OBJ.Variance <= 0) > 0)
v) Open the simulate.m located in #arima_Custom for editing (we copied it there in step iii). It is probably best to open this file by navigating to it manually in the Current Folder window, to ensure the correct simulate.m is opened. In this file, alter line 310 from:
E(:,(maxPQ + 1:end)) = Z * sqrt(variance);
to
%Check that the input variance is of the right length (if it isn't scalar)
if isscalar(variance) == 0
if size(variance, 2) ~= 1
error('Deterministic variance must be a column vector');
end
if size(variance, 1) ~= numObs
error('Deterministic variance vector is incorrect length relative to number of observations');
end
else
variance = variance(ones(numObs, 1));
end
%Scale innovations using deterministic variance
E(:,(maxPQ + 1:end)) = sqrt(ones(numPaths, 1) * variance') .* Z;
And we're done!
You should now be able to simulate with deterministically time-varying variance using the arima_Custom class, for example (for an ARIMA(0,1,0)):
ARIMAModel = arima_Custom('D', 1, 'Variance', ScalarVariance, 'Constant', 0);
ARIMAModel.Variance = TimeVaryingVarianceVector;
[X, e, VarianceVector] = simulate(ARIMAModel, NumObs, 'numPaths', NumPaths);
Further, you should also still be able to use matlab's original arima class, since we didn't alter it.

Related

Can't interpret SPSS Error Message in MATRIX code

A program to run a Schmid-Leiman transformation using SPSS's Matrix language was published in 2005 by Woolf & Preising in Behavior Research Methods volume 37, pages 48 to 58). It is probably not important for you to know what a Schmid-Leiman transformation is, but I'll explain in comments if you feel it is necessary.
In modifying the program for my own data, I'm getting an error I can't figure out:
Error # 12302 in column 12. Text: ,
Syntax error.
Execution of this command stops.
Error in RIGHT HAND SIDE of COMPUTE command.
The MATRIX statement skipped.
Here is the beginning of the code. The error is showing as coming in Line 6:
* Encoding: UTF-8.
* Schmid-Leiman Solution for 2 level higher-order Factor analysis.
Matrix.
* ENTER YOUR SPECIFICATIONS HERE.
* Enter first-order pattern matrix.
Compute F1={.461, .253, -.058, -.069;
.241, .600, .143, .033;
.582, .047, -.077, -.125;
.327, .297, -.120, -.166;
.176, .448, -.240, -.099;
.680, .069, -.036, -.138;
.415, .228, -.091, -.153;
.
.
.
.390, .205, .002, -.098;
.164, .369, -.170, -.047
}.
As shown above, the text generating the error is shown as a comma (,), but the actual text (following the COMPUTE statement) in column 12 is an open bracket ({). So I have no idea what is going on. Can someone help?
For reference, the original code as proposed by Woolf & Preising (2005) is found here;
The Woolf & Preising article is found here
PS: The sample program given in the link above does run on my copy of SPSS. Here's the beginning of that code:
* Schmid-Leiman Solution for 2 level higher-order Factor analysis.
Matrix.
* ENTER YOUR SPECIFICATIONS HERE.
* Enter first-order pattern matrix.
Compute F1={0.099, 0.5647, -0.1521;
0.0124, 0.9419, -0.1535;
-0.1501, 0.6177, 0.4218;
0.7441, -0.0882, 0.1425;
0.6241, 0.2793, -0.1137;
0.8693, -0.0331, 0.0289;
-0.0154, -0.2706, 0.6262;
-0.0914, 0.0995, 0.7216;
0.1502, 0.0835, 0.398}.

Performing an "online" linear interpolation

I have a problem where I need to do a linear interpolation on some data as it is acquired from a sensor (it's technically position data, but the nature of the data doesn't really matter). I'm doing this now in matlab, but since I will eventually migrate this code to other languages, I want to keep the code as simple as possible and not use any complicated matlab-specific/built-in functions.
My implementation initially seems OK, but when checking my work against matlab's built-in interp1 function, it seems my implementation isn't perfect, and I have no idea why. Below is the code I'm using on a dataset already fully collected, but as I loop through the data, I act as if I only have the current sample and the previous sample, which mirrors the problem I will eventually face.
%make some dummy data
np = 109; %number of data points for x and y
x_data = linspace(3,98,np) + (normrnd(0.4,0.2,[1,np]));
y_data = normrnd(2.5, 1.5, [1,np]);
%define the query points the data will be interpolated over
qp = [1:100];
kk=2; %indexes through the data
cc = 1; %indexes through the query points
qpi = qp(cc); %qpi is the current query point in the loop
y_interp = qp*nan; %this will hold our solution
while kk<=length(x_data)
kk = kk+1; %update the data counter
%perform online interpolation
if cc<length(qp)-1
if qpi>=y_data(kk-1) %the query point, of course, has to be in-between the current value and the next value of x_data
y_interp(cc) = myInterp(x_data(kk-1), x_data(kk), y_data(kk-1), y_data(kk), qpi);
end
if qpi>x_data(kk), %if the current query point is already larger than the current sample, update the sample
kk = kk+1;
else %otherwise, update the query point to ensure its in between the samples for the next iteration
cc = cc + 1;
qpi = qp(cc);
%It is possible that if the change in x_data is greater than the resolution of the query
%points, an update like the above wont work. In this case, we must lag the data
if qpi<x_data(kk),
kk=kk-1;
end
end
end
end
%get the correct interpolation
y_interp_correct = interp1(x_data, y_data, qp);
%plot both solutions to show the difference
figure;
plot(y_interp,'displayname','manual-solution'); hold on;
plot(y_interp_correct,'k--','displayname','matlab solution');
leg1 = legend('show');
set(leg1,'Location','Best');
ylabel('interpolated points');
xlabel('query points');
Note that the "myInterp" function is as follows:
function yi = myInterp(x1, x2, y1, y2, qp)
%linearly interpolate the function value y(x) over the query point qp
yi = y1 + (qp-x1) * ( (y2-y1)/(x2-x1) );
end
And here is the plot showing that my implementation isn't correct :-(
Can anyone help me find where the mistake is? And why? I suspect it has something to do with ensuring that the query point is in-between the previous and current x-samples, but I'm not sure.
The problem in your code is that you at times call myInterp with a value of qpi that is outside of the bounds x_data(kk-1) and x_data(kk). This leads to invalid extrapolation results.
Your logic of looping over kk rather than cc is very confusing to me. I would write a simple for loop over cc, which are the points at which you want to interpolate. For each of these points, advance kk, if necessary, such that qp(cc) is in between x_data(kk) and x_data(kk+1) (you can use kk-1 and kk instead if you prefer, just initialize kk=2 to ensure that kk-1 exists, I just find starting at kk=1 more intuitive).
To simplify the logic here, I'm limiting the values in qp to be inside the limits of x_data, so that we don't need to test to ensure that x_data(kk+1) exists, nor that x_data(1)<pq(cc). You can add those tests in if you wish.
Here's my code:
qp = [ceil(x_data(1)+0.1):floor(x_data(end)-0.1)];
y_interp = qp*nan; % this will hold our solution
kk=1; % indexes through the data
for cc=1:numel(qp)
% advance kk to where we can interpolate
% (this loop is guaranteed to not index out of bounds because x_data(end)>qp(end),
% but needs to be adjusted if this is not ensured prior to the loop)
while x_data(kk+1) < qp(cc)
kk = kk + 1;
end
% perform online interpolation
y_interp(cc) = myInterp(x_data(kk), x_data(kk+1), y_data(kk), y_data(kk+1), qp(cc));
end
As you can see, the logic is a lot simpler this way. The result is identical to y_interp_correct. The inner while x_data... loop serves the same purpose as your outer while loop, and would be the place where you read your data from wherever it's coming from.

The tensor product ti() in GAM package gives incorrect results

I am surprising to notice that it is somehow difficult to obtain a correct fit of interaction function from gam().
To be more specific, I want to estimate an additive function:
y=m_1(x)+m_2(z)+m_{12}(x,z)+u,
where m_1(x)=x^2, m_2(z)=z^2,m_{12}(x,z)=xz. The following code generate this model:
test1 <- function(x,z,sx=1,sz=1) {
#--m1(x) function
m.x<-x^2
m.x<-m.x-mean(m.x)
#--m2(z) function
m.z<-z^2
m.z<-m.z-mean(m.z)
#--m12(x,z) function
m.xz<-x*z
m.xz<-m.xz-mean(m.xz)
m<-m.x+m.z+m.xz
return(list(m=m,m.x=m.x,m.z=m.z,m.xz=m.xz))
}
n <- 1000
a=0
b=2
x <- runif(n,a,b)/20
z <- runif(n,a,b)
u <- rnorm(n,0,0.5)
model<-test1(x,z)
y <- model$m + u
So I use gam() by fitting the model as
b3 <- gam(y~ ti(x) + ti(z) + ti(x,z))
vis.gam(b3);title("tensor anova")
#---extracting basis matrix
B.f3<-model.matrix.gam(b3)
#---extracting series estimator
b3.hat<-b3$coefficients
Question: when I plot the estimated function by gam()above against its true function, I end up with
par(mfrow=c(1,3))
#---m1(x)
B.x<-B.f3[,c(2:5)]
b.x.hat<-b3.hat[c(2:5)]
plot(x,B.x%*%b.x.hat)
points(x,model$m.x,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
#---m2(z)
B.z<-B.f3[,c(6:9)]
b.z.hat<-b3.hat[c(6:9)]
plot(z,B.z%*%b.z.hat)
points(z,model$m.z,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
#---m12(x,z)
B.xz<-B.f3[,-c(1:9)]
b.xz.hat<-b3.hat[-c(1:9)]
plot(x,B.xz%*%b.xz.hat)
points(x,model$m.xz,col='red')
legend('topleft',c('Estimate','True'),lty=c(1,1),col=c('black','red'))
However, the function estimate of m_1(x) is largely different from x^2, and the interaction function estimate m_{12}(x,z) is also largely different from xz defined in test1 above. The results are the same if I use predict(b3).
I really can't figure it out. Can anybody help me out by explaining why the results end up with this? Greatly appreciate it!
First, the problem of the above issue is not due to the package, of course. It is closely related to the identification conditions of the smooth functions. One common practice is to impose the assumptions that E(mj(.))=0 for all individual function j=1,...,d, and E(m_ij(x_i,x_j)|x_i)=E(m_ij(x_i,x_j)|x_j)=0 for i not equal to j. Those conditions require one to employ centered basis function in series estimator, which has been done already in GAM package. However, in my case above, function m(x,z)=x*z defined in test1 does not satisfy the above identification assumptions, since the integral of x*z with respect to either x or z is not zero when x and z have range from zero to two.
Furthermore, series estimator allows the individual and interaction function to be identified if one impose m(0)=0 or m(0,x_j)=m(x_i,0)=0. This can be readily achieved if we center the basis function around zero. I have tried both cases, and they work well whenever DGP satisfies the identification conditions.

Why does Octave output $ g = [... ...] $

When I run this code (in a programming assignment for Coursera):
J = 1/m * [-y.*log(sigmoid((theta)'*X))-(1-y).*log(1-sigmoid((theta)'*X))]
where m = length(y), y is an m-dimensional vector, X is an m*2 matrix, and theta = 0.1, Octave outputs:
g =
[long (#rows)*2 matrix, each entry <1 but extremely close to 1]
g =
[another long (#rows)*2 matrix as before]
J =
[(#rows)*2 matrix with entries such as 3.4932e-002 and 7.8914e-005]
What is g? I never defined it, and it does not appear in my code, yet is outputted with some seemingly unrelated numbers? (I know that the function itself may have problems, but that is a separate issue from what I'm interested in here. I figured that if I know what g is, I might be able to troubleshoot better. If you have any comments on the function, please don't hesitate to point out what's wrong.)
Whenever you have a statement (inside a function or otherwise) which is not terminated with a semicolon, the output of that statement will display on the terminal.
Assuming that this is the only code you're running, then my guess is that inside your sigmoid function there is a statement of this kind:
g = dosomething() % note: not semicolon terminated!
resulting in terminal output during its execution.
The fact that g is reported twice in the terminal also makes sense, since you are calling the sigmoid function twice in that expression you just wrote.
Also, for the sake of clarity, please do not refer to your one-liner as a function, since that means something entirely different in the context of programming.

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

Resources