Can I use spatstat to measure spatial aggregation of the environment? - spatial

Currently I have individual raster data that represent the suitable environment for 18 species based on MAXENT predictions. I would like to know if the suitable environment for each species is either aggregated or not. I know that usually the R package spatsat is used to test aggregation of the spatial point pattern, but It seems that I can't not test that for the environment itself. Is that actually the case? Does any one of you would be able to direct me to a package that I could use to test aggregation of the environment?
Thanks in advance!!
TO FOLLOW UP THE QUESTION ABOVE WITH MORE DETAILS AND FIGURES
I have attached two images that I hope it makes my question more clear. So I would like to be able to quantify if the green cells (suitable environment) in figure A are more aggregated that the cells (suitable environment) in figure B. Green cells have a value of 1 and white space around then have a value of zero. I do not want to use the point locations of individuals, since I am not trying to test if the individual points are aggregated. What I was doing is using the X and Y coordinates of each green cell, but if I calculate the Clark Evans it shows is not aggregated for both. I think is because if I used the X and Y coordinates of the green cells for the Clark test all are part of a continue pattern as on figure 3. I hope this extra information is able to offer some help because I think I hit a wall now.
Potential aggregated environment
Potential no aggregated environment
Green cells X and Y coordinates used for Clark test

You can use spatstat to estimate the covariance function of each environment, treating each environment as a random set. Suppose G is a window (class "owin") representing the green cells; R is another window representing the red cells; and W is the containing window in which the environments are observed. To estimate the covariance function of the green cells you could do
cW <- setcov(W)
pG <- area(G)/area(W)
cG <- setcov(G)/(pG * cW)
cG[cW == 0] <- NA
fG <- rotmean(cG)
Then pG is the coverage fraction and fG is the (isotropic) covariance function. You could now do the same thing for R instead of G, and compare the two plots. Higher values of the covariance suggest a more aggregated environment.

Related

Trying to do PCA analysis on interest rate swaps data (multivariate time series)

I have a data set with 20 non-overlapping different swap rates (spot1y, 1y1y, 2y1y, 3y1y, 4y1y, 5y2y, 7y3y, 10y2y, 12y3y...) over the past year.
I want to use PCA / multiregression and look at residuals in order to determine which sectors on the curve are cheap/rich. Has anyone had experience with this? I've done PCA but not for time series. I'd ideally like to model something similar to the first figure here but in USD.
https://plus.credit-suisse.com/rpc4/ravDocView?docid=kv66a7
Thanks!
Here are some broad strokes that can help answer your question. Also, that's a neat analysis from CS :)
Let's be pythonistas and use NumPy. You can imagine your dataset as a 20x261 array of floats. The first place to start is creating the array. Suppose you have a CSV file storing the raw data persistently. Then a reasonable first step to load the data would be something as simple as:
import numpy
x = numpy.loadtxt("path/to/my/file")
The object x is our raw time series matrix, and we verify the truthness of x.shape == (20, 261). The next step is to transform this array into it's covariance matrix. Whether it has been done on the raw data already, or it still has to be done, the first step is centering each time series on it's mean, like this:
x_centered = x - x.mean(axis=1, keepdims=True)
The purpose of this step is to help simplify any necessary rescaling, and is a very good habit that usually shouldn't be skipped. The call to x.mean uses the parameters axis and keepdims to make sure each row (e.g. the time series for spot1yr, ...) is centered with it's mean value.
The next steps are to square and scale x to produce a swap rate covariance array. With 2-dimensional arrays like x, there are two ways to square it-- one that leads to a 261x261 array and another that leads to a 20x20 array. It's the second array we are interested in, and the squaring procedure that will work for our purposes is:
x_centered_squared = numpy.matmul(x_centered, x_centered.transpose())
Then, to scale one can chose between 1/261 or 1/(261-1) depending on the statistical context, which looks like this:
x_covariance = x_centered_squared * (1/261)
The array x_covariance has an entry for how each swap rate changes with itself, and changes with any one of the other swap rates. In linear-algebraic terms, it is a symmetric operator that characterizes the spread of each swap rate.
Linear algebra also tells us that this array can be decomposed into it's associated eigen-spectrum, with elements in this spectrum being scalar-vector pairs, or eigenvalue-eigenvector pairs. In the analysis you shared, x_covariance's eigenvalues are plotted in exhibit two as percent variance explained. To produce the data for a plot like exhibit two (which you will always want to furnish to the readers of your PCA), you simply divide each eigenvalue by the sum of all of them, then multiply each by 100.0. Due to the convenient properties of x_covariance, a suitable way to compute it's spectrum is like this:
vals, vects = numpy.linalg.eig(x_covariance)
We are now in a position to talk about residuals! Here is their definition (with our namespace): residuals_ij = x_ij − reconstructed_ij; i = 1:20; j = 1:261. Thus for every datum in x, there is a corresponding residual, and to find them, we need to recover the reconstructed_ij array. We can do this column-by-column, operating on each x_i with a change of basis operator to produce each reconstructed_i, each of which can be viewed as coordinates in a proper subspace of the original or raw basis. The analysis describes a modified Gram-Schmidt approach to compute the change of basis operator we need, which ensures this proper subspace's basis is an orthogonal set.
What we are going to do in the approach is take the eigenvectors corresponding to the three largest eigenvalues, and transform them into three mutually orthogonal vectors, x, y, z. Research the web for active discussions and questions geared toward developing the Gram-Schmidt process for all sorts of practical applications, but for simplicity let's follow the analysis by hand:
x = vects[0] - sum([])
xx = numpy.dot(x, x)
y = vects[1] - sum(
(numpy.dot(x, vects[1]) / xx) * x
)
yy = numpy.dot(y, y)
z = vects[2] - sum(
(numpy.dot(x, vects[2]) / xx) * x,
(numpy.dot(y, vects[2]) / yy) * y
)
It's reasonable to implement normalization before or after this step, which should be informed by the data of course.
Now with the raw data, we implicitly made the assumption that the basis is standard, we need a map between {e1, e2, ..., e20} and {x,y,z}, which is given by
ch_of_basis = numpy.array([x,y,z]).transpose()
This can be used to compute each reconstructed_i, like this:
reconstructed = []
for measurement in x.transpose().tolist():
reconstructed.append(numpy.dot(ch_of_basis, measurement))
reconstructed = numpy.array(reconstructed).transpose()
And then you get the residuals by subtraction:
residuals = x - reconstructed
This flow obviously might need further tuning, but it's the gist of how to do compute all the residuals. To get that periodic bar plot, take the average of each row in residuals.

Time series clustering of activity of machines

I have a NxM matrix where N is the number of time intervals and M are the number of nodes in a graph.
Each cell indicates the nodes that were active in that time interval
Now I need to find group of nodes that always appear together across time series. Is there some approach I can use to cluster these nodes together based on their time series activity.
In R you could do this:
# hierarchical clustering
library(dendextend) # contains color_branches()
dist_ts <- dist(mydata) # calculate distances
hc_dist <- hclust(dist_ts)
dend_ts <- as.dendrogram(hc_dist)
# set some value for h (height within the dendrogram) here that makes sense for you
dend_100 <- color_branches(dend_ts, h = 100)
plot(dend_100)
This creates a dendrogram with colored branches.
You could do much better visualizations, but your post is pretty generic (somewhat unclear what you're asking) and you didn't indicate whether you like R at all.
As the sets may overlap most clustering methods will not produce optimum results.
Instead, treat each time point as a transaction, containing all active nodes as items. Then run frequent itemset mining to find frequently active sets of machines.

Non-linear interaction terms in Stata

I have a continuous dependent variable polity_diff and a continuous primary independent variable nb_eq. I have hypothesized that the effect of nb_eq will vary with different levels of the continuous variable gini_round in a non-linear manner: The effect of nb_eq will be greatest for mid-range values of gini_round and close to 0 for both low and high levels of gini_round (functional shape as a second-order polynomial).
My question is: How this is modelled in Stata?
To this point I've tried with a categorized version of gini_round which allows me to compare the different groups, but obviously this doesn't use data to its fullest. I can't get my head around the inclusion of a single interaction term which allows me to test my hypothesis. My best bet so far is something along the lines of the following (which is simplified by excluding some if-arguments etc.):
xtreg polity_diff c.nb_eq##c.gini_round_squared, fe vce(cluster countryno),
but I have close to 0 confidence that this is even nearly right.
Here's how I might do it:
sysuse auto, clear
reg price c.weight#(c.mpg##c.mpg) i.foreign
margins, dydx(weight) at(mpg = (10(10)40))
marginsplot
margins, dydx(weight) at(mpg=(10(10)40)) contrast(atcontrast(ar(2(1)4)._at) wald)
We interact weight with a second degree polynomial of mpg. The first margins calculates the average marginal effect of weight at different values of mpg. The graph looks like what you describe. The second margins compares the slopes at adjacent values of mpg and does a joint test that they are all equal.
I would probably give weight its own effect as well (two octothorpes rather than one), but the graph does not come out like your example:
reg price c.weight##(c.mpg##c.mpg) i.foreign

Compare Plots in matlab

I have two plots in matlab where in I have plotted x and y coordinates. If I have these two plots, is it possible to compare if the plots match? Can I obtain numbers to tell how well they match?
Note that the graphs could possibly be right/left/up/down shifted in plot (turning axis off is not problem), scaled/rotated (I would also like to know if it is skewed, but for now, it is not a must).
It will not need to test color elements, color inversion and any other complicated graphic properties than basic ones mentioned above.
If matlab is not enough, I would welcome other tools.
Note that I cannot simply take the absolute difference of x- and y- values. I could obtain x-absolute difference average and y-absolute difference and then average but I need a combined error. I need to compare the graph.
Graphs to be compared.
EDIT
Direct Correlation does not work for me.
For a different set of data: I got .94 correlation. This is very high for given data. noticing that one data is fluctuating less and faster than other.
You can access the plotted data with this code
x = 10:100;
y = log10(x);
plot(x,y);
h = gcf;
axesObjs = get(h, 'Children'); %axes handles
dataObjs = get(axesObjs, 'Children'); %handles to low-level graphics objects in axes
objTypes = get(dataObjs, 'Type'); %type of low-level graphics object
xdata = get(dataObjs, 'XData'); %data from low-level grahics objects
ydata = get(dataObjs, 'YData');
Then you can do a correlation between xdata and ydata for example, or any kind of comparison. The coefficient R will indicate a percent match.
[R,P] = corrcoef(xdata, ydata);
You would also be interested in comparing the axes limits in the graphical current axes. For example
R = ( diff(get(h_ax1,'XLim')) / diff(get(h_ax2,'XLim')) ) + ...
( diff(get(h_ax1,'YLim')) / diff(get(h_ax2,'YLim')) )
where h_ax1 is the handle of the first axe and h_ax2 for the second one. Here, you will have a comparison between values of (XLim + YLim). The possible comparisons with different gca properties are really vast though.
EDIT
To compare two sets of points, you may use other metrics than analytical relationship. I think of distances or convergences such as the Hausdorff distance. A script is available here in matlab central. I used such distance to compare letter shapes. In the wikipedia page, the 'Applications' section is of importance (edge detector for thick shapes, but it may not be pertinent to your particular problem).

setting nls parameters for a curve

I am a relative newcomer to R and not a mathematician but a geneticist. I have many sets of multiple pairs of data points. When they are plotted they yield a flattened S curve with most of the data points ending up near the zero mark. A minority of the data points fly far off creating what is almost two J curves, one down and one up. I need to find the inflection points where the data sharply veers upward or downward. This may be an issue with my math but is seems to me that if I can smooth and fit a curve to the line and get an equation I could then take the second derivative of the curve and determine the inflection points from where the second derivative changes sign. I tried it in excel and used the curve to get approximate fit to get the starting formula but the data has a bit of "wiggling" in it so determining any one inflection point is not possible even if I wanted to do it all manually (which I don't). Each of the hundreds of data sets that I have to find these two inflection points in will yield about the same curve but will have slightly different inflection points and determining those inflections points precisely is absolutely critical to the problem. So if I can set it up properly once in an equation that should do it. For simplicity I would like to break them into the positive curve and the negative curve and do each one separately. (Maybe there is some easier formula for s curves that makes that a bad idea?)
I have tried reading the manual and it's kind of hard to understand likely because of my weak math skills. I have also been unable to find any similar examples I could study from.
This is the head of my data set:
x y
[1,] 1 0.00000000
[2,] 2 0.00062360
[3,] 3 0.00079720
[4,] 4 0.00085100
[5,] 5 0.00129020
(X is just numbering 1 to however many data points and the number of X will vary a bit by the individual set.)
This is as far as I have gotten to resolve the curve fitting part:
pos_curve1 <- nls(curve_fitting ~ (scal*x^scal),data = cbind.data.frame(curve_fitting),
+ start = list(x = 0, scal = -0.01))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
Am I just doing the math the hard way? What am I doing wrong with the nls? Any help would be much much appreciated.
Found it. The curve is exponential not J and the following worked.
fit <- nls(pos ~ a*tmin^b,
data = d,
start = list(a = .1, b = .1),
trace = TRUE)
Thanks due to Jorge I Velez at R Help Oct 26, 2009
Also I used "An Appendix to An R Companion to Applied Regression, second edition" by John Fox & Sanford Weisberg last revision 13: Dec 2010.
Final working settings for me were:
fit <- nls(y ~ a*log(10)^(x*b),pos_curve2,list(a = .01, b = .01), trace=TRUE)
I figured out what the formula should be by using open office spread sheet and testing the various curve fit options until I was able to show exponential was the best fit. I then got the structure of the equation from that. I used the Fox & Sanford article to understand the way to set the parameters.
Maybe I am not alone in this one but I really found it hard to figure out the parameters and there were few references or questions on it that helped me.

Resources