Changing Kademlia Metric - Unidirectional Property Importance - dht

Kademlia uses XOR metric. Among other things, this has so called "unidirectional" property (= for any given point x and distance e>0, there is exactly one point y such that d(x,y)=e).
First question is a general question: Is this property of the metric critical for the functionality of Kademlia, or is it just the thing that helps with revealing pressure from certain nodes (as the original paper suggests). In other words, if we want to change the metric, how important is to come with a metric that is "unidirectional" as well?
Second question is about concrete change of the metric: Let's assume we have node identifiers (addresses) as X-bit numbers, would any of the following metric work with Kademlia?
d(x,y) = abs(x-y)
d(x,y) = abs(x-y) + 1/(x xor y)
The first metric simply provides difference between numbers, so for node ID 100 the nodes with IDs 90 and 110 are equally distant, so this is not unidirectional metric. In the second case we fix that adding 1/(x xor y), where we know that (x xor y) is unidirectional, so having 1/(x xor y) should preserve this property.
Thus for node ID 100, the node ID 90 is d(100,90) = 10 + 1/62, while the distance from node ID 110 is d(100,110) = 10 + 1/10.

You wouldn't be dealing with kademlia anymore. There are man other routing algorithms which use different distance metrics, some even non-uniform distance metrics, but they do not rely on kademlia-specific assumptions and sometimes incorporate other features to compensate for some undesirable aspect of those metrics.
Since there can be ties in the metric (two candidates for each point), lookups could no longer converge on a precise set of closest nodes.
Bucket splitting and other routing table maintenance algorithms would need to be changed since they assume that identical distances can only occur with node identity.
I'm not sure whether it would affect Big-O properties or other guarantees of kademlia.
Anyway, this seems like an X-Y problem. You want to modify the metric to serve a particular goal. Maybe you should look for routing overlays designed with that goal in mind instead.
d(x,y) = abs(x-y) + 1/(x xor y)
This seems impractical, division on integers suffers from rounding. and in reality you would not be dealing with such small numbers but much larger (e.g. 160bit) numbers, making divisions more expensive too.

Related

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

How to combine various distance functions into one given the following dataset?

I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.

How to find an eigenvector given eigenvalue 1, minimising memory use

I'd be grateful if people could help me find an efficient way (probably low memory algorithm) to tackle the following problem.
I need to find the stationary distribution x of a transition matrix P. The transition matrix is an extremely large, extremely sparse matrix, constructed such that all the columns sum to 1. Since the stationary distribution is given by the equation Px = x, then x is simply the eigenvector of P associated with eigenvalue 1.
I'm currently using GNU Octave to both generate the transition matrix, find the stationary distribution, and plot the results. I'm using the function eigs(), which calculates both eigenvalues and eigenvectors, and it is possible to return just one eigenvector, where the eigenvalue is 1 (I actually had to specify 1.1, to prevent an error). Construction of the transition matrix (using a sparse matrix) is fairly quick, but finding the eigenvector gets increasingly slow as I increase the size, and I'm running out of memory before I can examine even moderately sized problems.
My current code is
[v l] = eigs(P, 1, 1.01);
x = v / sum(v);
Given that I know that 1 is the eigenvalue, I'm wondering if there is either a better method to calculate the eigenvector, or a way that makes more efficient use of memory, given that I don't really need an intermediate large dense matrix. I naively tried
n = size(P,1); % number of states
Q = P - speye(n,n);
x = Q\zeros(n,1); % solve (P-I)x = 0
which fails, since Q is singular (by definition).
I would be very grateful if anyone has any ideas on how I should approach this, as it's a calculation I have to perform a great number of times, and I'd like to try it on larger and more complex models if possible.
As background to this problem, I'm solving for the equilibrium distribution of the number of infectives in a cattle herd in a stochastic SIR model. Unfortunately the transition matrix is very large for even moderately sized herds. For example: in an SIR model with an average of 20 individuals (95% of the time the population is between 12 and 28 individuals), P is 21169 by 21169 with 20340 non-zero values (i.e. 0.0005% dense), and uses up 321 Kb (a full matrix of that size would be 3.3 Gb), while for around 50 individuals P uses 3 Mb. x itself should be pretty small. I suspect that eigs() has a dense matrix somewhere, which is causing me to run out of memory, so I should be okay if I can avoid using full matrices.
Power iteration is a standard way to find the dominant eigenvalue of a matrix. You pick a random vector v, then hit it with P repeatedly until you stop seeing it change very much. You want to periodically divide v by sqrt(v^T v) to normalise it.
The rate of convergence here is proportional to the separation between the largest eigenvalue and the second largest eigenvalue. Each iteration takes just a couple of matrix multiplies.
There are fancier-pants ways to do this ("PageRank" is one good thing to search for here) that improve speed for really huge sparse matrices, but I don't know that they're necessary or useful here.
Your approach seems like a good one. However, what you're calling x, is the null space of Q. null(Q) would work if it supported sparse matrices, but it doesn't. There's a bunch of stuff on the web for finding the null space of a sparse matrix. For example:
http://www.mathworks.co.uk/matlabcentral/newsreader/view_thread/249467
http://www.mathworks.com/matlabcentral/fileexchange/42922-null-space-for-sparse-matrix/content/nulls.m
http://www.mathworks.com/matlabcentral/fileexchange/11120-null-space-of-a-sparse-matrix
It seems the best solution is to use the Power Iteration method, as suggested by tmyklebu.
The method is to iterate x = Px; x /= sum(x), until x converges. I'm assuming convergence if the d1 norm between successive iterations is less than 1e-5, as that seems to give good results.
Convergence can take a while, since the largest two eigenvalues are fairly close (the number of iterations needed to converge can vary considerably, from around 200 to 2000 depending on the model used and population sizes, but it gets there in the end). However, the memory requirements are low, and it's very easy to implement.

What are some relevant A.I. techniques for programming a flock of entities?

Specifically, I am talking about programming for this contest: http://www.nodewar.com/about
The contest involves you facing other teams with a swarm of spaceships in a two dimensional world. The ships are constrained by a boundary (if they exit they die), and they must continually avoid moons (who pull the ships in with their gravity). The goal is to kill the opposing queen.
I have attempted to program a few, relatively basic techniques, but I feel as though I am missing something fundamental.
For example, I implemented some of the boids behaviours (http://www.red3d.com/cwr/boids/), but they seemed to lack a... goal, so to speak.
Are there any common techniques (or, preferably, combinations of techniques) for this sort of game?
EDIT
I would just like to open this up again with a bounty since I feel like I'm still missing critical pieces of information. The following is my NodeWar code:
boundary_field = (o, position) ->
distance = (o.game.moon_field - o.lib.vec.len(o.lib.vec.diff(position, o.game.center)))
return distance
moon_field = (o, position) ->
return o.lib.vec.len(o.lib.vec.diff(position, o.moons[0].pos))
ai.step = (o) ->
torque = 0;
thrust = 0;
label = null;
fields = [boundary_field, moon_field]
# Total the potential fields and determine a target.
target = [0, 0]
score = -1000000
step = 1
square = 1
for x in [(-square + o.me.pos[0])..(square + o.me.pos[0])] by step
for y in [(-square + o.me.pos[1])..(square + o.me.pos[1])] by step
position = [x, y]
continue if o.lib.vec.len(position) > o.game.moon_field
value = (fields.map (f) -> f(o, position)).reduce (t, s) -> t + s
target = position if value > score
score = value if value > score
label = target
{ torque, thrust } = o.lib.targeting.simpleTarget(o.me, target)
return { torque, thrust, label }
I may have implemented potential fields incorrectly, however, since all the examples I could find are about discrete movements (whereas NodeWar is continuous and not exact).
The main problem being my A.I. never stays within the game area for more than 10 seconds without flying off-screen or crashing into a moon.
You can easily make the boids algorithm play the game of Nodewar, by just adding additional steering behaviours to your boids and/or modifying the default ones. For example, you would add a steering behaviour to avoid the moon, and a steering behaviour for enemy ships (repulsion or attraction, depending on the position between yours and the enemy ship). The weights of the attraction/repulsion forces should then be tweaked (possibly by genetic algorithms, playing different configurations against each other).
I believe this approach would give you already a relatively strong baseline player, to which you can start adding collaborative strategies etc.
You can achieve the best possible flocking behaviour using control theory. We can start by expressing the state of a ship (a vector of positions and velocities) as variable X. We will use x to represent a small deviation from some state X0. For each node, linearize around the state you want the individual ships to be at:
d/dt(X) = f(X - X0)
where X0 is the state you want the ship to be at. f() can be nonlinear (as in the case of your potential field). Close to this point, the will obey
d/dt(x) = Ax
A is a fixed matrix which you should tweak to achieve stability. Much of the field of control theory tells you how to do this. It's very important to you that A leads to a stable system and that it converges fast. You should now break out MATLAB or GNU Octave. You should also read up on what poles mean in control theory, and Laplace transforms. The poles of the system (eigenvalues of matrix A) will characterize the response of the ship to disturbances of any kind, its stability, and its convergeance speed. It will also tell you whether the ship will move towards X0 or oscillate around X0.
There are several ways of choosing A (You probably don't get full control over A because of the game's limited physics) without having to do much analysis yourself. Two such algorithms are optimal control (You define the cost of controlling the system vs the cost of deviating from X0), and pole placement (You choose where the poles will go).
The state space theory also applies to an entire flock of ships, deviating from the configuration desired by every ship at that point in time. If the system is nonlinear, You might not be able to guarantee that all configurations are stable.
Conclusion:
Use control theory to guarantee individual stability, calculating X0 based on the ships around you, and add a potential field to prevent ships from bumping into eachother and moons.
Disclaimer:
There are several ways of expressing a state space system. This is the simplest but you'll need to express it differently when considering the open loop system (split up A into another 'A' which gives you the physical system and some other matrices which give you the control parameters)

Recommended anomaly detection technique for simple, one-dimensional scenario?

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.
For example, with the following example data:
a = 10
b = 14
c = 25
d = 467
e = 12
d is clearly an anomaly, and I would want to perform a specific action based on this.
I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.
Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.
Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.
The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.
The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.
So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.
Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.
What is a recommended anomaly detection technique for simple, one-dimensional data?
Check out the three-sigma rule:
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
An alternative method is the IQR outlier test:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN x is an extreme outlier
this test is usually employed by Box plots (indicated by the whiskers):
EDIT:
For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.
#smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.
A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.
DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.
If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.
If the number of neighbors is less than minPoints, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.
Finally the set of all points marked as noise are considered outliers.
There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.
After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by #Amro as a good starting point.
For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.
This is an old topic but still it lacks some information.
Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:
Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.
The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing
I think this problem can be solved in a few lines of python code like this:
import numpy as np
import scipy.stats as sts
x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion
Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:
array([ 0.6745 , 0. , 1.854875, 76.387125, 0.33725 ])
or the 467 entry being rejected.
Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.
Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.
The three-sigma rule is correct
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
The IQR test should be:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
If x > Q75 + 1.5 * IQR or x < Q25 - 1.5 * IQR THEN x is a mild outlier
If x > Q75 + 3.0 * IQR or x < Q25 – 3.0 * IQR THEN x is a extreme outlier

Resources