I'm writing a JavaScript library for calculating graph measurements such as degree centrality, eccentrality, closeness and betweenness.
In order to validate my library I use two exist applications Gephi and NodeXL to run calculation with them.
The problem is I got what looks like different results.
I build simple graph:
(A) ----- (B)
| |
| |
(C) ----- (D)
Gephi gave those results:
A ecc=2 close=1.333 bet=0.5
B ecc=2 close=1.333 bet=0.5
C ecc=2 close=1.333 bet=0.5
D ecc=2 close=1.333 bet=0.5
NodeXL gave those results:
A close=0.25 bet=0.5
B close=0.25 bet=0.5
C close=0.25 bet=0.5
D close=0.25 bet=0.5
Note that NodeXL does not calculate eccentrality.
Which one is right?
Are the results really different?
I didn't normalize (or at least not intend to normalize) any results.
It seems that Gephi returns the average sum of all shortest paths between a node and all other nodes in the network (also stated in the doc)
for A this gives: (1 + 1 + 2)/3=1.333333
while NodeXL gives you the inverse sum of all shortest paths:
for A 1/(1+1+2)=0.25
So, I'd say the later is correct, as this is following the definition of closeness centrality. E.g. igraph also uses the second version.
actually both measures are right. The one computed by NodeXL is the closeness centrality and the other computer by Gephi is the inverse closeness centrality. Therefore, in the case of inverse closeness centrality the higher the value, the close to the center.
The difference between both centralities lies in consideration of graph sizes and efficiency. The closeness centrality is independent from graph sizes => comparison of closeness of nodes from different networks can be done. The inverse centrality is more efficient (precise) calculation of the closeness but it depends on the graph size.
References:
Sabidussi, G.: The centrality index of a graph. Psychometrika 31(4) (1966) 581{
603
Linton C. Freeman: Centrality in Social Networks. Conceptual Clarification. Social
Networks 1 (1978/79) 215-239
Hope that can clarify the difference.
Related
I'm building a similarity graph in Neo4j and gds.nodeSimilarity.stats is reporting a mean similarity score in the 0.60 to 0.85 range for the projection I'm using regardless of how I transform the graph. I've tried:
Only projecting relationships with edge weights greater than 1
Deleting the core node to increase the number of components (my graph is about a single topic, with the core node representing that topic)
Changing it to an undirected graph
I realize I can always set the similarityCutoff in gds.nodeSimilarity.write to a higher value, but I'm second-guessing myself since all the toy problems I used for training, including Neo4j's practices, had mean Jaccard scores less than 0.5. Am I overthinking this or is it a sign that something is wrong?
*** EDITED TO ADD ***
This is a graph that has two types of nodes: Posts and entities. The posts reflect various media types, while the entities reflect various authors and proper nouns. In this case, I'm mostly focused on Twitter. Some examples of relationships:
(e1 {Type:TwitterAccount})-[TWEETED]->(p:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
(e1 {Type:TwitterAccount})-[TWEETED]->(p2:Post
{Type:Tweet})-[QUOTE_TWEETED]->(p2:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
For my code, I've tried first projecting only AT_MENTIONED relationships:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"],
"AT_MENTIONED")
I've tried doing that with a reversed orientation:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"], {AT_MENTIONED:{type:'AT_MENTIONED', orientation:'REVERSE'}})
I've tried creating a monopartite, weighted relationship between all the nodes with a RELATED_TO relationship ...
MATCH (e1:Entity)-[*2..3]->(e2:Entity) WHERE e1.Type = 'TwitterAccount' AND e2.Type = 'TwitterAccount' AND id(e1) < id(e2) WITH e1, e2, count(*) as strength MERGE (e1)-[r:RELATED_TO]->(e2) SET r.strength
= strength
...and then projecting that:
CALL gds.graph.create("similarity_graph", "Entity", "RELATED_TO")
Whichever one of the above I try, I then get my Jaccard distribution by running:
CALL gds.nodeSimilarity.stats('similarity_graph') YIELD nodesCompared, similarityDistribution
Part of why you are getting a high similarity score is because the default topK value is 10. This means that the relationships will be created / are considered only between the top 10 neighbors of a node. Try running the following query:
CALL gds.nodeSimilarity.stats('similarity_graph', {topK:1000})
YIELD nodesCompared, similarityDistribution
Now you will probably get a lower mean similarity distribution.
How dense the similarity graph should be depends on your use-case. You can try the default values and see how it goes. If that is still too dense you can raise the similarityCutoff threshold, and if it is too sparse you can raise the topK parameter. There is no silver bullet, it depends on your usecase and dataset.
Changing the relationship direction will heavily influence the results. In a graph of
(:User)-[:RELATIONSHIP]->(:Item)
the resulting monopartite network will be a network of users. However if you reverse the relationship
(:User)<-[:RELATIONSHIP]-(:Item)
Then the resulting network will be a network of items.
Finally, having Jaccard mean at 0.7 when you use topK 10 is actually great as that means that the relationship will be between actual similar nodes. The Neo4j examples lower the similarity cutoff just so some relationships are created and the similarity graph is not too sparse. You can also raise the topK parameter, it's hard to say exactly without more information about the size of your graph.
I am working on solving the following problem and implement the solution in C++.
Let us assume that we have an oriented weighted graph G = (V, A, w) and P a set of persons.
We receive a number of queries such that every query gives a person p and two vertices s and d and asks to compute the minimum weighted path between s and d for the person p. One person can have multiple paths.
After the end of all queries I have a number k <= |A| and I should give k arcs such that the number of persons using at least one of the k arcs is maximal (this is a maximum coverage problem).
To solve the first part I implemented the Djikistra algorithm using priority_queue and I compute the minimal weight between s and d. (Is this a good way to do ?)
To solve the second part I store for every arc the set of persons that use this arc and I use a greedy algorithm to compute the set of arcs (at each stage, I choose an arc used by the largest number of uncovered persons). (Is this a good way to do it ?)
Finally, if my algorithms are goods how can I implement them efficiently in C++?
Suppose you have a graph G = (V, E). You can do whatever you want in terms of preprocessing on this graph G (within reasonable time and space constraints for a graph with a few thousands vertices, so you couldn't just store every possible answer for example).
Now suppose I select a subset V' of V. I want the MST over just these vertices V'. How do you do this quickly and efficiently?
There are two ways to solve the problem. Their performance is dependent to different states of the problem.
Applying MST algorithms on sub-graph(solve from scratch).
Using dynamic algorithms to update tree after changes in the problem.
There are two types of dynamic algorithms:
I) Edge insertion and deletion**
G. Ramalingam and T. Reps, “On the computational complexity of
dynamic graph problems,” Theoret. Comput. Sci., vol. 158, no. 1, pp.
233–277, 1996.
II) Edge weight decreasing and increasing**
D. Frigioni, A. Marchetti-Spaccamela, and U. Nanni, “Fully dynamic
output bounded single source shortest path problem,” in ACM-SIAM
Symp. Discrete Algorithms, 1996, pp. 212–221.
“Fully dynamic algorithms for maintaining shortest paths trees,”
J. Algorithms, vol. 34, pp. 251–281, 2000.
You can use them directly or change them with respect to the problem and consider node insertion and deletion.
In DBSCAN, the core points is defined as having more than MinPts within Eps.
So if MinPts = 4, a points with total 5 points in Eps is definitely a core point.
How about a point with 4 points (including itself) in Eps? Is it a core point, or a border point?
Border points are points that are (in DBSCAN) part of a cluster, but not dense themselves (i.e. every cluster member that is not a core point).
In the followup algorithm HDBSCAN, the concept of border points was discarded.
Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013).
Density-Based Clustering Based on Hierarchical Density Estimates.
Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013. Lecture Notes in Computer Science 7819. p. 160.
doi:10.1007/978-3-642-37456-2_14
which states:
Our new definitions are more consistent with a statistical interpretation of clusters as connected components of a level set of a density [...] border objects do not technically belong to the level set (their estimated density is below the threshold).
Actually, I just re-read the original paper and Definition 1 makes it look like the core point belongs to its own eps neighborhood. So if minPts is 4, then a point needs at least 3 others in its eps neighborhood.
Notice in Definition 1 that they say NEps(p) = {q ∈D | dist(p,q) ≤ Eps}. If the point were excluded from its eps neighborhood, then it would have said NEps(p) = {q ∈D | dist(p,q) ≤ Eps and p != q}. Where != is "not equal to".
This is also reinforced by the authors of DBSCAN in their OPTICS paper in Figure 4. http://fogo.dbs.ifi.lmu.de/Publikationen/Papers/OPTICS.pdf
So I think the SciKit interpretation is correct and the Wikipedia illustration is misleading in http://en.wikipedia.org/wiki/DBSCAN
This largely depends on the implementation. The best way is to just play with the implementation yourself.
In the original DBSCAN1 paper, core point condition is given as N_Eps>=MinPts, where N_Eps is the Epsilon neighborhood of a certain data point, which is excluded from its own N_Eps.
Following your example, if MinPts = 4 and N_Eps = 3 (or 4 including itself as you say), then they don't form a cluster according to the original paper. On the other hand, the scikit-learn2 implementation of DBSCAN works otherwise, meaning it counts the point itself for forming a group. So for MinPts=4, four points are needed in total to form a cluster.
[1] Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise."
[2] http://scikit-learn.org
I've been reading papers on pairwise ranking and this is what I don't get:
what is the difference in the training/testing data between pointwise and pairwise ranking?
This is the paper that I have been reading:
http://www.cs.cornell.edu/people/tj/publications/joachims_02c.pdf
In there, it says that a data point in pairwaise ranking is an inequality between two links:
[line] .=. [inequality between two links, which is the target] qid:[qid] [[feature of both link 1 and 2]:[value of 1 and 2]] # [info]
RankLib, however, does support pairwise rankers like RankNet and RankBoost, but the datapoint format that it uses it's that of pointwise
[line] .=. [absolute ranking, which is the target] qid:[qid] [feature1]:[value1] [feature2]:[value2] ... # [info]
Is there something I am missing?
Point wise ranking is analogous to regression. Each point has an associated rank score, and you want to predict that rank score. So your labeled data set will have a feature vector and associated rank score given a query
IE: {d1, r1} {d2, r2} {d3, r3} {d4, r4}
where r1 > r2 > r3 >r4
Pairwise ranking is analogous to classification. Each data point is associated with another data point, and the goal is to learn a classifier which will predict which of the two is "more" relevant to a given query.
IE: {d1 > d2} {d2 > d3} {d3 > d4}