So I have 2 matrices: each is 100x100.
I am looking to calculate a 3rd matrix such that: M3[i]=sqrt(M1[i]^2 + M2[i]^2).
I can obviously do ForLoops but I am sure there is something faster.
I digged into the accelerate framework and got lost in Linpack world
Any help to get me on the right track...
Thanks
The Accelerate framework is a good idea.
You could use a function like vDSP_vsq to operate on one column of the matrix at a time, placing the result in the corresponding column of M3. You might have to make two resultant matrices (one which will be M1^2, the other M2^2), and then add them to form the correct M3 result matrix by adding the columns with a call to vDSP_vadd which, again, will be able to operate one column at a time.
There is sample code (showing how to add two vectors, etc) here at the Apple developer page.
I think the fastest way is to use
vDSP_vpythg
Vector Pythagoras; single precision.
Subtracts vector C from A and squares the differences, subtracts vector D from B and squares the differences, adds the two sets of squared differences, and then writes the square roots of the sums to vector E.
Obviously pass ZERO vector for C and D .
Related
I have a PDB structure that I'm analyzing which has a putative binding pocket in it for some endogenous ligand.
I'd like to to determine for the relevant amino acids within, say, 3A of the pocket, where the optimal hydrogen bond donor/acceptor pair for the ligand would be within the pocket. I have done this already for determining locations of optimal pi-pi stacking (e.g. find aromatic residues, determine plane the face of the ring, go out N angstroms orthogonal to the face), but i'm struggling to consider this for hydrogen bonds.
Can this be done?
Weel, I'll try to write out how I would try to do it.
First of all its not clear to me if your pocket is described by a grid that represent the pocket surface, or by a grid that represent all the pocket space (lets call it pocket cloud).
With Biopython assuming you have a cloud described by your grid:
Loop over all the cloud-grid points:
for every point loop over all the PDB atoms that are H donor or acceptor:
if the distance is in the desidered target range (3A - distance for optimal
donor or acceptor pair):
select the corresponding AA/atom/point
add to your result list the point as donor/acceptor/or both togeher
with the atom/AA selected
else:
pass
with Biopyton and distances see here: Biopython PDB: calculate distance between an atom and a point
H bonds are generally 2.7 to 3.3 Å
I am not sure my logic is correct, the idea is to end up with a subset of your grids point where you have red grid points where you could pose a donor and blue ones where you could pose an acceptor.
We are talking only about distances here, if you introduce geometry factors of the bond I think you should need a ligand with its own geometry too
Of course with this approach you would waste a lot of time on not productive computation, if You find a way to select only the grid surface point you could select a subset of PDB atoms that are close to the surface (3A) and then use the same approach above.
I have a data set with 20 non-overlapping different swap rates (spot1y, 1y1y, 2y1y, 3y1y, 4y1y, 5y2y, 7y3y, 10y2y, 12y3y...) over the past year.
I want to use PCA / multiregression and look at residuals in order to determine which sectors on the curve are cheap/rich. Has anyone had experience with this? I've done PCA but not for time series. I'd ideally like to model something similar to the first figure here but in USD.
https://plus.credit-suisse.com/rpc4/ravDocView?docid=kv66a7
Thanks!
Here are some broad strokes that can help answer your question. Also, that's a neat analysis from CS :)
Let's be pythonistas and use NumPy. You can imagine your dataset as a 20x261 array of floats. The first place to start is creating the array. Suppose you have a CSV file storing the raw data persistently. Then a reasonable first step to load the data would be something as simple as:
import numpy
x = numpy.loadtxt("path/to/my/file")
The object x is our raw time series matrix, and we verify the truthness of x.shape == (20, 261). The next step is to transform this array into it's covariance matrix. Whether it has been done on the raw data already, or it still has to be done, the first step is centering each time series on it's mean, like this:
x_centered = x - x.mean(axis=1, keepdims=True)
The purpose of this step is to help simplify any necessary rescaling, and is a very good habit that usually shouldn't be skipped. The call to x.mean uses the parameters axis and keepdims to make sure each row (e.g. the time series for spot1yr, ...) is centered with it's mean value.
The next steps are to square and scale x to produce a swap rate covariance array. With 2-dimensional arrays like x, there are two ways to square it-- one that leads to a 261x261 array and another that leads to a 20x20 array. It's the second array we are interested in, and the squaring procedure that will work for our purposes is:
x_centered_squared = numpy.matmul(x_centered, x_centered.transpose())
Then, to scale one can chose between 1/261 or 1/(261-1) depending on the statistical context, which looks like this:
x_covariance = x_centered_squared * (1/261)
The array x_covariance has an entry for how each swap rate changes with itself, and changes with any one of the other swap rates. In linear-algebraic terms, it is a symmetric operator that characterizes the spread of each swap rate.
Linear algebra also tells us that this array can be decomposed into it's associated eigen-spectrum, with elements in this spectrum being scalar-vector pairs, or eigenvalue-eigenvector pairs. In the analysis you shared, x_covariance's eigenvalues are plotted in exhibit two as percent variance explained. To produce the data for a plot like exhibit two (which you will always want to furnish to the readers of your PCA), you simply divide each eigenvalue by the sum of all of them, then multiply each by 100.0. Due to the convenient properties of x_covariance, a suitable way to compute it's spectrum is like this:
vals, vects = numpy.linalg.eig(x_covariance)
We are now in a position to talk about residuals! Here is their definition (with our namespace): residuals_ij = x_ij − reconstructed_ij; i = 1:20; j = 1:261. Thus for every datum in x, there is a corresponding residual, and to find them, we need to recover the reconstructed_ij array. We can do this column-by-column, operating on each x_i with a change of basis operator to produce each reconstructed_i, each of which can be viewed as coordinates in a proper subspace of the original or raw basis. The analysis describes a modified Gram-Schmidt approach to compute the change of basis operator we need, which ensures this proper subspace's basis is an orthogonal set.
What we are going to do in the approach is take the eigenvectors corresponding to the three largest eigenvalues, and transform them into three mutually orthogonal vectors, x, y, z. Research the web for active discussions and questions geared toward developing the Gram-Schmidt process for all sorts of practical applications, but for simplicity let's follow the analysis by hand:
x = vects[0] - sum([])
xx = numpy.dot(x, x)
y = vects[1] - sum(
(numpy.dot(x, vects[1]) / xx) * x
)
yy = numpy.dot(y, y)
z = vects[2] - sum(
(numpy.dot(x, vects[2]) / xx) * x,
(numpy.dot(y, vects[2]) / yy) * y
)
It's reasonable to implement normalization before or after this step, which should be informed by the data of course.
Now with the raw data, we implicitly made the assumption that the basis is standard, we need a map between {e1, e2, ..., e20} and {x,y,z}, which is given by
ch_of_basis = numpy.array([x,y,z]).transpose()
This can be used to compute each reconstructed_i, like this:
reconstructed = []
for measurement in x.transpose().tolist():
reconstructed.append(numpy.dot(ch_of_basis, measurement))
reconstructed = numpy.array(reconstructed).transpose()
And then you get the residuals by subtraction:
residuals = x - reconstructed
This flow obviously might need further tuning, but it's the gist of how to do compute all the residuals. To get that periodic bar plot, take the average of each row in residuals.
I need to recognize images with hand-written numerals with known values. Physical objects with the number are always identical but come in slight variations of positions/scale/lighting. They are about 100 in number, having about 100x500 px in size.
In the first pass, the code should "learn" possible inputs, and then recognize them (classify them as being close to one of the "training" images) when they come again.
I was mostly following the Feature Matching Python-OpenCV tutorial
Input images are analyzed first, keypoints & descriptors are remembered in the orbTrained list:
import cv2
import collections
ORBTrained=collections.namedtuple('ORBTrained',['kp','des','img'])
orbTrained=[]
for img in trainingImgs:
z2=preprocessImg(img)
orb=cv2.ORB_create(nfeatures=400,patchSize=30,edgeThreshold=0)
kp,des=orb.detectAndCompute(z2,None)
orbTrained.append(ORBTrained(kp=kp,des=des,img=z2))
z3=cv2.drawKeypoints(z2,kp,None,color=(0,255,0),flags=0)
A typical result of this first stage looks like this:
Then in the next loop, for each real input image, cycle through all training images to see which is matching the best:
ORBMatch=collections.namedtuple('ORBMatch',['dist','match','train'])
for img in inputImgs:
z2=preprocessNum(img)
orb=cv2.ORB_create(nfeatures=400,patchSize=30,edgeThreshold=0)
kp,des=orb.detectAndCompute(z2,None)
bf=cv2.BFMatcher(cv2.NORM_HAMMING,crossCheck=True)
mm=[]
for train in orbTrained:
m=bf.match(des,train.des)
dist=sum([m_.distance for m_ in m])
mm.append(ORBMatch(dist=dist,match=m,train=train))
# sort matching images based on score
mm.sort(key=lambda m: m.dist)
print([m.dist for m in mm[:5]])
best=mm[0]
best.match.sort(key=lambda x:x.distance) # sort matches in the best match
z3=cv2.drawMatches(z2,kp,best.train.img,best.train.kp,best.match[:50],None,flags=2)
The result I get is nonsensical, and consistently so (only when I run with pixel-identical input, the result is correct):
What is the problem? Am I completely misunderstanding what to do, or do I just need to tune some parameters?
First, are you sure you're not reinventing the wheel by creating your own OCR library? There are many free frameworks, some of which support training with custom character sets.
Second, you should understand what feature matching is. It will find similar small areas, but isn't aware of other feature pairs. It will match similar corners of characters, not the characters itself. You might experiment with larger patchSize so that it covers at least half of the digit.
You can minimize false pairs by running feature detection only on a single digit at a time using thresholding and contours to find character bounds.
If the text isn't rotated, using roation-invariant feature descriptor such as ORB isn't the best option, try rotation-variant descriptor, such as FAST.
According to authors of papers (ORB and ORB-SLAM) ORB is invariant to rotation and scale "in a certain range". May be you should first match for small scale or rotation change.
I have a continuous dependent variable polity_diff and a continuous primary independent variable nb_eq. I have hypothesized that the effect of nb_eq will vary with different levels of the continuous variable gini_round in a non-linear manner: The effect of nb_eq will be greatest for mid-range values of gini_round and close to 0 for both low and high levels of gini_round (functional shape as a second-order polynomial).
My question is: How this is modelled in Stata?
To this point I've tried with a categorized version of gini_round which allows me to compare the different groups, but obviously this doesn't use data to its fullest. I can't get my head around the inclusion of a single interaction term which allows me to test my hypothesis. My best bet so far is something along the lines of the following (which is simplified by excluding some if-arguments etc.):
xtreg polity_diff c.nb_eq##c.gini_round_squared, fe vce(cluster countryno),
but I have close to 0 confidence that this is even nearly right.
Here's how I might do it:
sysuse auto, clear
reg price c.weight#(c.mpg##c.mpg) i.foreign
margins, dydx(weight) at(mpg = (10(10)40))
marginsplot
margins, dydx(weight) at(mpg=(10(10)40)) contrast(atcontrast(ar(2(1)4)._at) wald)
We interact weight with a second degree polynomial of mpg. The first margins calculates the average marginal effect of weight at different values of mpg. The graph looks like what you describe. The second margins compares the slopes at adjacent values of mpg and does a joint test that they are all equal.
I would probably give weight its own effect as well (two octothorpes rather than one), but the graph does not come out like your example:
reg price c.weight##(c.mpg##c.mpg) i.foreign
Say, I have a signal represented as an array of real numbers y = [1,2,0,4,5,6,7,90,5,6]. I can use Daubechies-4 coefficients D4 = [0.482962, 0.836516, 0.224143, -0.129409], and apply a wavelet transform to receive high- and low-frequencies of the signal. So, the high frequency component will be calculated like this:
high[v] = y[2*v]*D4[0] + y[2*v+1]*D4[1] + y[2*v+2]*D4[2] + y[2*v+3]*D4[3],
and the low frequency component can be calculated using other D4 coefs permutation.
The question is: what if y is complex array? Do I just multiply and add complex numbers to receive subbands, or is it correct to get amplitude and phase, treat each of them like a real number, do the wavelet transform for them, and then restore complex number array for each subband using formulas real_part = abs * cos(phase) and imaginary_part = abs * sin(phase)?
To handle the case of complex data, you're looking at the Complex Wavelet Transform. It's actually a simple extension to the DWT. The most common way to handle complex data is to treat the real and imaginary components as two separate signals and perform a DWT on each component separately. You will then receive the decomposition of the real and imaginary components.
This is commonly known as the Dual-Tree Complex Wavelet Transform. This can best be described by the figure below that I pulled from Wikipedia:
Source: Wikipedia
It's called "dual-tree" because you have two DWT decompositions happening in parallel - one for the real component and one for the imaginary. In the above diagram, g0/h0 represent the low-pass and high-pass components of the real part of the signal x and g1/h1 represent the low-pass and high-pass components of the imaginary part of the signal x.
Once you decompose the real and imaginary parts into their respective DWT decompositions, you can combine them to get the magnitude and/or phase and proceed to the next step or whatever you desire to do with them.
The mathematical proof regarding the correctness of this is outside the scope of what we're talking about, but if you would like to see how this got derived, I refer you to the canonical paper by Kingsbury in 1997 in the work Image Processing with Complex Wavelets - http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=835E60EAF8B1BE4DB34C77FEE9BBBD56?doi=10.1.1.55.3189&rep=rep1&type=pdf. Pay close attention to the noise filtering of images using the CWT - this is probably what you're looking for.