Rolling mean for a row from column x on wards - mean

I have a large data set with multiple rows and columns. I want to calculate the rolling mean for each cell in a row from column x on wards. Similarly I want to calculate the standard deviation as well. How can I do it? Also I want to do so for all the rows

In what programming language, what kind of file.
General math for rolling mean would be
average = column x
for element in range (column x+1 to y)
average = (average + element)/2
Standard deviation is a lot more complex and you'd be better off using a library for it.
For python - get scipy or just numpy
javascipt - base math library

Related

Transforming Features to increase similarity

I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)

Standard Deviation of 2 Datasets (Each Having Standard Deviation)

So say I had 2 datasets (each dataset is a set of values and each have a stanrdard deviation).
I want to find the mean difference between the two datasets elementwise e.g. ((element1_set1 - element1_set2) + (element2_set1 - element2_set2)) / 2 for two datasets of length 2.
Does this mean that I have to add the standard deviations elementwise and then find the mean of these to get the overall stanrdard deviation?
Or do I just find the mean and std of the array, [element1_set1 - element1_set2, element2_set1 - element2_set2]?
I don't really get why you mix in the standard deviation there.
For getting the mean difference, you can just subtract the means.
That works because of the following (assuming x are the elements of the first dataset and y the elements of the second):
But that doesn't work with standard deviation because of the squares.

Trying to do PCA analysis on interest rate swaps data (multivariate time series)

I have a data set with 20 non-overlapping different swap rates (spot1y, 1y1y, 2y1y, 3y1y, 4y1y, 5y2y, 7y3y, 10y2y, 12y3y...) over the past year.
I want to use PCA / multiregression and look at residuals in order to determine which sectors on the curve are cheap/rich. Has anyone had experience with this? I've done PCA but not for time series. I'd ideally like to model something similar to the first figure here but in USD.
https://plus.credit-suisse.com/rpc4/ravDocView?docid=kv66a7
Thanks!
Here are some broad strokes that can help answer your question. Also, that's a neat analysis from CS :)
Let's be pythonistas and use NumPy. You can imagine your dataset as a 20x261 array of floats. The first place to start is creating the array. Suppose you have a CSV file storing the raw data persistently. Then a reasonable first step to load the data would be something as simple as:
import numpy
x = numpy.loadtxt("path/to/my/file")
The object x is our raw time series matrix, and we verify the truthness of x.shape == (20, 261). The next step is to transform this array into it's covariance matrix. Whether it has been done on the raw data already, or it still has to be done, the first step is centering each time series on it's mean, like this:
x_centered = x - x.mean(axis=1, keepdims=True)
The purpose of this step is to help simplify any necessary rescaling, and is a very good habit that usually shouldn't be skipped. The call to x.mean uses the parameters axis and keepdims to make sure each row (e.g. the time series for spot1yr, ...) is centered with it's mean value.
The next steps are to square and scale x to produce a swap rate covariance array. With 2-dimensional arrays like x, there are two ways to square it-- one that leads to a 261x261 array and another that leads to a 20x20 array. It's the second array we are interested in, and the squaring procedure that will work for our purposes is:
x_centered_squared = numpy.matmul(x_centered, x_centered.transpose())
Then, to scale one can chose between 1/261 or 1/(261-1) depending on the statistical context, which looks like this:
x_covariance = x_centered_squared * (1/261)
The array x_covariance has an entry for how each swap rate changes with itself, and changes with any one of the other swap rates. In linear-algebraic terms, it is a symmetric operator that characterizes the spread of each swap rate.
Linear algebra also tells us that this array can be decomposed into it's associated eigen-spectrum, with elements in this spectrum being scalar-vector pairs, or eigenvalue-eigenvector pairs. In the analysis you shared, x_covariance's eigenvalues are plotted in exhibit two as percent variance explained. To produce the data for a plot like exhibit two (which you will always want to furnish to the readers of your PCA), you simply divide each eigenvalue by the sum of all of them, then multiply each by 100.0. Due to the convenient properties of x_covariance, a suitable way to compute it's spectrum is like this:
vals, vects = numpy.linalg.eig(x_covariance)
We are now in a position to talk about residuals! Here is their definition (with our namespace): residuals_ij = x_ij − reconstructed_ij; i = 1:20; j = 1:261. Thus for every datum in x, there is a corresponding residual, and to find them, we need to recover the reconstructed_ij array. We can do this column-by-column, operating on each x_i with a change of basis operator to produce each reconstructed_i, each of which can be viewed as coordinates in a proper subspace of the original or raw basis. The analysis describes a modified Gram-Schmidt approach to compute the change of basis operator we need, which ensures this proper subspace's basis is an orthogonal set.
What we are going to do in the approach is take the eigenvectors corresponding to the three largest eigenvalues, and transform them into three mutually orthogonal vectors, x, y, z. Research the web for active discussions and questions geared toward developing the Gram-Schmidt process for all sorts of practical applications, but for simplicity let's follow the analysis by hand:
x = vects[0] - sum([])
xx = numpy.dot(x, x)
y = vects[1] - sum(
(numpy.dot(x, vects[1]) / xx) * x
)
yy = numpy.dot(y, y)
z = vects[2] - sum(
(numpy.dot(x, vects[2]) / xx) * x,
(numpy.dot(y, vects[2]) / yy) * y
)
It's reasonable to implement normalization before or after this step, which should be informed by the data of course.
Now with the raw data, we implicitly made the assumption that the basis is standard, we need a map between {e1, e2, ..., e20} and {x,y,z}, which is given by
ch_of_basis = numpy.array([x,y,z]).transpose()
This can be used to compute each reconstructed_i, like this:
reconstructed = []
for measurement in x.transpose().tolist():
reconstructed.append(numpy.dot(ch_of_basis, measurement))
reconstructed = numpy.array(reconstructed).transpose()
And then you get the residuals by subtraction:
residuals = x - reconstructed
This flow obviously might need further tuning, but it's the gist of how to do compute all the residuals. To get that periodic bar plot, take the average of each row in residuals.

2.3 ratio between Pytorch BCEloss and my own "log" calculations

I'm scripting a toy model to both practice PyTorch and GAN models, and I'm making sure I understand each step as much as possible.
That leaded me to checking my understanding of the BCEloss function, and apparently I understand it... with a ratio of 2.3.
To check the results, I write the intermediate values for Excel:
tmp1 = y_pred.tolist() # predicted values in list (to copy/paste on Excel)
tmploss = nn.BCELoss(reduction='none') # redefining a loss giving the whole BCEloss tensor
tmp2 = tmploss(y_pred, y_real).tolist() # BCEloss values in list (to copy/paste Exel)
Then I copy tmp1 on Excel and calculate: -log(x) for each values, which is the BCEloss formula for y_target = y_real = 1.
Then I compare the resulting values with the values of tmp2: these values are 2.3x higher than "mine".
(Sorry, I couldn't figure out how to format tables on this site...)
Can you please tell me what is happening? I feel a PEBCAK coming :-)
This is because in Excel the Log function calculates the logarithm to the base 10.
The standard definition of binary cross entropy uses a log function to the base e.
The ratio you're seeing is just log(10)=2.302585

Predict future values using highcharts/Highstock

I need to predict the future values based on given set of data. I found in the following link a method of obtaining trend line moving average.
http://www.highcharts.com/plugin-registry/single/16/technical-indicators
jsfiddle is here http://jsfiddle.net/laff/WaEBc/
But my requirement is based on this Moving average to predict the future values.
Searched a lot, but couldn't find. please help.
Thanks!
How it should work, if you need to predict, you need to calculate any points to achieve that. Its not build-in.
To find the equation to produce a trend line, search for Linear Regression.
You will need to calculate the slope and intercept using the linear regression calculations, and you build your trend line using those two values, combined with an x value for the start and end points that are defined by the min and max x values of the data set.
(ie your first point is {x: min x value, y: intercept}. your second point is {x: max x value, y: intercept + (slope * max x value)} )
Much more importantly:
Trend lines do NOT predict future values that fall outside of the existing range of the independent variable in the data.
Using regression to plot a line in this way will help you build a predictive model of what your dependent variable may be when given a known independent variable.
It will absolutely not give you a reliable prediction of what will happen to Y as X increases beyond the scope of the known data, especially when X is a time value.
Building an actual predictive model of values over time is much more involved, and there isn't one single way to do it. It depends on what factors affect those values, and what data you have to demonstrate those effects.
some reference:
Predictive modelling

Resources