Determine location of optimal hydrogen bond donor/acceptor pair? - biopython

I have a PDB structure that I'm analyzing which has a putative binding pocket in it for some endogenous ligand.
I'd like to to determine for the relevant amino acids within, say, 3A of the pocket, where the optimal hydrogen bond donor/acceptor pair for the ligand would be within the pocket. I have done this already for determining locations of optimal pi-pi stacking (e.g. find aromatic residues, determine plane the face of the ring, go out N angstroms orthogonal to the face), but i'm struggling to consider this for hydrogen bonds.
Can this be done?

Weel, I'll try to write out how I would try to do it.
First of all its not clear to me if your pocket is described by a grid that represent the pocket surface, or by a grid that represent all the pocket space (lets call it pocket cloud).
With Biopython assuming you have a cloud described by your grid:
Loop over all the cloud-grid points:
for every point loop over all the PDB atoms that are H donor or acceptor:
if the distance is in the desidered target range (3A - distance for optimal
donor or acceptor pair):
select the corresponding AA/atom/point
add to your result list the point as donor/acceptor/or both togeher
with the atom/AA selected
else:
pass
with Biopyton and distances see here: Biopython PDB: calculate distance between an atom and a point
H bonds are generally 2.7 to 3.3 Å
I am not sure my logic is correct, the idea is to end up with a subset of your grids point where you have red grid points where you could pose a donor and blue ones where you could pose an acceptor.
We are talking only about distances here, if you introduce geometry factors of the bond I think you should need a ligand with its own geometry too
Of course with this approach you would waste a lot of time on not productive computation, if You find a way to select only the grid surface point you could select a subset of PDB atoms that are close to the surface (3A) and then use the same approach above.

Related

PETSc vectorize operations with neighboring vector values

I'm implementing finite difference algorithm from uFDTD book. Many FDM equations involve operations on adjoined vector elements.
For example, an update equation for electric field
ez[m] = ez[m] + (hy[m] - hy[m-1]) * imp0
uses adjoined vector values hy[m] and hy[m-1].
How can I implement these operations in PETSc efficiently? Is there something beyond local vector loops and scatterers?
If my goal was efficiency, I would call a stencil engine. There are many many many papers, and sometimes even open source code, for example, Devito. The idea is that PETSc manages the data structure and parallelism. Then you can feed the local data brick to your favorite stencil computer.

Measure distance by RSSI in veins4.4 Omnet++5 SUMO0.25

I am a master student working with localization in VANEts
in this moment I am working on a trilateration method based on RSSI for
Cooperative Positioning (CP).
I am considering the Analogue Model : Simple Path Loss Model
But I have some doubts in how to calculate the distance correctly for a determined Phy Model.
I spent some time (one day) reading some papers of Dr. Sommer about the PHY models included in veins.
Would anyone help-me with this solution?
I need a way to:
1) Measure the power of an receiver when its receive a beacon (I found this in the Decider class).
In the Decider802.11p the received Power can be obtained with this line in method Decider80211p::processSignalEnd(AirFrame* msg):
double recvPower_dBm = 10*log10(signal.getReceivingPower()->getValue(start));
2) Apply a formula of RSSI accordingly the phy model in order to achieve a distance estimation between transmiter and receiver.
3) Asssociate this measure (distance by RSSI) with the Wave Short Message to be delivered in AppLayer of the receiver (that is measuring the RSSI).
After read the paper "On the Applicability of Two-Ray Path Loss Models for Vehicular Network Simulation"
and the paper "A Computationally Inexpensive Empirical Model of IEEE 802.11p Radio Shadowing in Urban Environments"
and investigating how it works in the veins project. I noticed that each analogue model have your own path loss model
with your own variables to describe the model.
For example for the SimplePathLossModel we have these
variables defined on AnalogueModels folder of veins modules:
lambda = 0.051 m (wave length to IEEE 802.11p CCH center frequency of 5.890 GHz)
A constant alpha = 2 (default value used)
a distance factor given by pow(sqrDistance, -pathLossAlphaHalf) / (16.0 * M_PI * M_PI);
I found one formula for indoor environments in this link, but I am in doubt if it is applicable for vehicular environments.
Any clarification is welcome. Thanks a lot.
Technically, you are correct. Indeed, you could generate a simple look-up table: have one vehicle drive past another one, record distance and RSSIs, and you have a table that can map RSSI to mean distance (without knowing how the TX power, antenna gains, path loss model, fading models, etc, are configured).
In the simplest case, if you assume that antennas are omnidirectional, that path loss follows the Friis transmission equation, that no shadow fading occurs, and that fast fading is negligible, your table will be perfect.
In a more complicated case, where your simulation also includes probabilistic fast fading (say, a Nakagami model), shadow fading due to radio obstacles (buildings), etc. your table will still be roughly correct, but less so.
It is important to consider a real-life application, though. Consider if your algorithm still works if conditions change (more reflective road surface changing reflection parameters, buildings blocking more or less power, antennas with non-ideal or even unknown gain characteristics, etc).

Non-linear interaction terms in Stata

I have a continuous dependent variable polity_diff and a continuous primary independent variable nb_eq. I have hypothesized that the effect of nb_eq will vary with different levels of the continuous variable gini_round in a non-linear manner: The effect of nb_eq will be greatest for mid-range values of gini_round and close to 0 for both low and high levels of gini_round (functional shape as a second-order polynomial).
My question is: How this is modelled in Stata?
To this point I've tried with a categorized version of gini_round which allows me to compare the different groups, but obviously this doesn't use data to its fullest. I can't get my head around the inclusion of a single interaction term which allows me to test my hypothesis. My best bet so far is something along the lines of the following (which is simplified by excluding some if-arguments etc.):
xtreg polity_diff c.nb_eq##c.gini_round_squared, fe vce(cluster countryno),
but I have close to 0 confidence that this is even nearly right.
Here's how I might do it:
sysuse auto, clear
reg price c.weight#(c.mpg##c.mpg) i.foreign
margins, dydx(weight) at(mpg = (10(10)40))
marginsplot
margins, dydx(weight) at(mpg=(10(10)40)) contrast(atcontrast(ar(2(1)4)._at) wald)
We interact weight with a second degree polynomial of mpg. The first margins calculates the average marginal effect of weight at different values of mpg. The graph looks like what you describe. The second margins compares the slopes at adjacent values of mpg and does a joint test that they are all equal.
I would probably give weight its own effect as well (two octothorpes rather than one), but the graph does not come out like your example:
reg price c.weight##(c.mpg##c.mpg) i.foreign

Kohonen Self Organizing Maps: Determining the number of neurons and grid size

I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.
Thanks!
Quoting from the som_make function documentation of the som toolbox
It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The
'mapsize' argument influences the final number of map units: a 'big'
map has x4 the default number of map units and a 'small' map has
x0.25 the default number of map units.
dlen is the number of records in your dataset
You can also read about the classic WEBSOM which addresses the issue of large datasets
http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf
http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf
Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.
I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.
topology=function(data)
{
#Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
D=data
# munits: número de hexágonos
# dlen: número de sujetos
dlen=dim(data)[1]
dim=dim(data)[2]
munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
#munits=100
#size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
A=matrix(Inf,nrow=dim,ncol=dim)
for (i in 1:dim)
{
D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
}
for (i in 1:dim){
for (j in i:dim){
c=D[,i]*D[,j]
c=c[is.finite(c)];
A[i,j]=sum(c)/length(c)
A[j,i]=A[i,j]
}
}
VS=eigen(A)
eigval=sort(VS$values)
if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
ratio=1
}else{
ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}
size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
size2=round(munits/size1)
return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE)))
}
hope it helps...
Iván Vallés-Pérez
I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.
If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:
Growing SOM
Growing Neural Gas
Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.
Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.
my suggestion would be the following
SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.
Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters.
Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.

Accelerate framework ios: Fastest pythagoren calculation

So I have 2 matrices: each is 100x100.
I am looking to calculate a 3rd matrix such that: M3[i]=sqrt(M1[i]^2 + M2[i]^2).
I can obviously do ForLoops but I am sure there is something faster.
I digged into the accelerate framework and got lost in Linpack world
Any help to get me on the right track...
Thanks
The Accelerate framework is a good idea.
You could use a function like vDSP_vsq to operate on one column of the matrix at a time, placing the result in the corresponding column of M3. You might have to make two resultant matrices (one which will be M1^2, the other M2^2), and then add them to form the correct M3 result matrix by adding the columns with a call to vDSP_vadd which, again, will be able to operate one column at a time.
There is sample code (showing how to add two vectors, etc) here at the Apple developer page.
I think the fastest way is to use
vDSP_vpythg
Vector Pythagoras; single precision.
Subtracts vector C from A and squares the differences, subtracts vector D from B and squares the differences, adds the two sets of squared differences, and then writes the square roots of the sums to vector E.
Obviously pass ZERO vector for C and D .

Resources