While using the FastRP algorithm, a phrase in the documentation caught my attention. I also faced this situation.
Link: https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/?_gl=1*1pjy8fd*_ga*OTg2ODkyMjYuMTY0NzI1Njk2Mg..*_ga_DL38Q8KGQC*MTY0NzUwMDg5MS4xNS4xLjE2NDc1MDIwMDAuMA..&_ga=2.25047225.28509462.1647256962-98689226.1647256962
Phrase: Because of L2 normalization which is applied to each iteration (here only one iteration), all nodes have the same embedding despite having different age values (apart from rounding errors).
When getting embedding with FastRP on a graph (Let's consider only the properties, that is, propertyRatio = 1), how can the embedding of 2 nodes with exactly the same values be the different? In the link I shared above, this was explained as if it was a normal situation, but it seemed a bit inconvenient to me.
If there is a single node property value and propertyRatio of 1.0, then the embeddings are identical. However, as soon as you add more node properties or lower the propertyRatio, the values of node properties come into play.
One thing to note is that node values are normalized node by node, so if you use propertyRatio of 1 with the following nodes:
(a:Person {age: 10, numberOfPets: 1}), (b:Person {age: 100, numberOfPets: 10})
The embeddings will still be identical. However for example the (c:Person {age: 10, numberOfPets: 10}) would have a different embedding.
As far as I understand, the node values are normalized prior to being used in the FastRP algorithm as to not overpower the original fastRP embeddings (the network position encoding).
Related
I have a large-scale network consisting of 62,578 nodes. It represents the network topology of an SDN network (Software Defined Networking).
I want to partition the graph into a number of clusters, each cluster should be controlled by an SDN controller.
I tried to use the k-means algorithm, but it doesn't take into account the relationships between the nodes. This algorithm relies on the nodes and their properties.
Then I tried the similarity algorithm, but it calculates the similarity score between 2 nodes and creates a new relationship that holds this value between these 2 nodes. As a result, I couldn't benefit from this way in k-means algorithm.
Louvain and Leiden don't allow the number of clusters in advance. Is there any way to do that with these algorithms?
Suggestions Please. Any idea may be of great help to me. Many thanks.
Update: This photo is part of the whole graph.
Update 2:
CALL gds.graph.project
(
'myGraph',
'node',
'to',
{
relationshipProperties: 'cost'
}
)
CALL gds.fastRP.write
(
'myGraph',
{
embeddingDimension: 1,
writeProperty: 'fastrp-embedding'
}
)
YIELD nodePropertiesWritten
CALL gds.graph.project
(
'myGraph_1',
{
node: {
properties: 'fastrp-embedding'
}
},
'*'
)
CALL gds.alpha.kmeans.write
('myGraph_1',
{
nodeProperty: 'fastrp-embedding',
k: 3,
randomSeed: 42,
writeProperty: 'kmeans'
})
YIELD nodePropertiesWritten
Update 3:
I applied FastRP to create node embeddings for a graph consisting of 6,301 nodes. These embeddings are stored as node properties to be used in clustering using the K-means algorithm. I noticed that although the nodes are nearby to each other, they are assigned to different clusters.
Notes:
For FastRP, I set the embedding dimensions to 256.
For K-means, I set K to 2.
I tried a smaller embedding equals 2, 4, etc, The same results occurred.
In addition, I tried another graph size with 8,846 nodes. Similar incomprehensible results occurred.
I didn't specify a random seed for FastRP. I didn't know how to set a preferred value for this parameter. I it related to the graph size like the node embedding?
For the sub-graph below, the following are the results of clustering:
If you want to define the number of clusters, you can try the following.
First create node embeddings using any of the available models like the FastRP or node2vec. Next, use the k-means algorithm to cluster nodes based on the embeddings.
Use the mutate mode for the FastRP and store the results under lets say embedding property. Next, run k-means algorithm on the embeddings like in the example:
CALL gds.alpha.kmeans.write('cities', {
nodeProperty: 'embedding',
k: 3,
writeProperty:'cluster'
})
where the k parameter defines the number of clusters and the nodeProperty should point to the embedding property.
I'm building a similarity graph in Neo4j and gds.nodeSimilarity.stats is reporting a mean similarity score in the 0.60 to 0.85 range for the projection I'm using regardless of how I transform the graph. I've tried:
Only projecting relationships with edge weights greater than 1
Deleting the core node to increase the number of components (my graph is about a single topic, with the core node representing that topic)
Changing it to an undirected graph
I realize I can always set the similarityCutoff in gds.nodeSimilarity.write to a higher value, but I'm second-guessing myself since all the toy problems I used for training, including Neo4j's practices, had mean Jaccard scores less than 0.5. Am I overthinking this or is it a sign that something is wrong?
*** EDITED TO ADD ***
This is a graph that has two types of nodes: Posts and entities. The posts reflect various media types, while the entities reflect various authors and proper nouns. In this case, I'm mostly focused on Twitter. Some examples of relationships:
(e1 {Type:TwitterAccount})-[TWEETED]->(p:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
(e1 {Type:TwitterAccount})-[TWEETED]->(p2:Post
{Type:Tweet})-[QUOTE_TWEETED]->(p2:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})
For my code, I've tried first projecting only AT_MENTIONED relationships:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"],
"AT_MENTIONED")
I've tried doing that with a reversed orientation:
CALL gds.graph.create('similarity_graph', ["Entity", "Post"], {AT_MENTIONED:{type:'AT_MENTIONED', orientation:'REVERSE'}})
I've tried creating a monopartite, weighted relationship between all the nodes with a RELATED_TO relationship ...
MATCH (e1:Entity)-[*2..3]->(e2:Entity) WHERE e1.Type = 'TwitterAccount' AND e2.Type = 'TwitterAccount' AND id(e1) < id(e2) WITH e1, e2, count(*) as strength MERGE (e1)-[r:RELATED_TO]->(e2) SET r.strength
= strength
...and then projecting that:
CALL gds.graph.create("similarity_graph", "Entity", "RELATED_TO")
Whichever one of the above I try, I then get my Jaccard distribution by running:
CALL gds.nodeSimilarity.stats('similarity_graph') YIELD nodesCompared, similarityDistribution
Part of why you are getting a high similarity score is because the default topK value is 10. This means that the relationships will be created / are considered only between the top 10 neighbors of a node. Try running the following query:
CALL gds.nodeSimilarity.stats('similarity_graph', {topK:1000})
YIELD nodesCompared, similarityDistribution
Now you will probably get a lower mean similarity distribution.
How dense the similarity graph should be depends on your use-case. You can try the default values and see how it goes. If that is still too dense you can raise the similarityCutoff threshold, and if it is too sparse you can raise the topK parameter. There is no silver bullet, it depends on your usecase and dataset.
Changing the relationship direction will heavily influence the results. In a graph of
(:User)-[:RELATIONSHIP]->(:Item)
the resulting monopartite network will be a network of users. However if you reverse the relationship
(:User)<-[:RELATIONSHIP]-(:Item)
Then the resulting network will be a network of items.
Finally, having Jaccard mean at 0.7 when you use topK 10 is actually great as that means that the relationship will be between actual similar nodes. The Neo4j examples lower the similarity cutoff just so some relationships are created and the similarity graph is not too sparse. You can also raise the topK parameter, it's hard to say exactly without more information about the size of your graph.
I want to create a vocabulary graph with word vectors. The aim is to query for nearest word in vocabulary graph based on word similarity. How can we achieve this on neo4j?
The following is an example:
Suppose vocabulary consists of the following:
Product Quality
Wrong Product
Product Price
Product Replacement
And query word is: Affordable Product
In a single query I should be able to figure out that "Affordable Product" is more closely related to "Product Price" than any others.
Please note that I am storing word embedding in the graph, and hence cosine similarity check on each of the words in the vocabulary one by one will help me achieve this. However when vocabulary becomes large, querying one by one hinders speed and performance.
If there is any way to store the word embeddings for domain vocabulary as a graph, which can be queried for nearest node based on cosine similarity, it can be a possible solution. However not been able to find any thing like this so far.
Looking forward for pointers if any as well. Thanks
What you want to do is to store your embedding results into the graph. Next step is to use Neo4j Graph Data Science library, and run specifically cosine similarity algorithm. It should look something along the lines of:
MATCH (p:Word)
WITH {item:id(p), weights: p.embedding} AS wordData
WITH collect(wordData) AS data
CALL gds.alpha.similarity.cosine.write({
nodeProjection: '*',
relationshipProjection: '*',
data: data,
// here is where you define how many nearest neighbours should be stored
topK: 1,
// here you define what is the minimal similarity between a
// given pair of node to be still relevant
similarityCutoff: 0.1
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty, min, max, mean, p95
You have now preprocessed your nearest neighbors and can easily query them like:
MATCH (w:Word)-[:SIMILAR]-(other)
RETURN other
Hope this helps, let me know if you have any other questions.
After tryout and reading our several options I found that https://github.com/facebookresearch/faiss is the best option for this use case.
Suppose you have a graph G = (V, E). You can do whatever you want in terms of preprocessing on this graph G (within reasonable time and space constraints for a graph with a few thousands vertices, so you couldn't just store every possible answer for example).
Now suppose I select a subset V' of V. I want the MST over just these vertices V'. How do you do this quickly and efficiently?
There are two ways to solve the problem. Their performance is dependent to different states of the problem.
Applying MST algorithms on sub-graph(solve from scratch).
Using dynamic algorithms to update tree after changes in the problem.
There are two types of dynamic algorithms:
I) Edge insertion and deletion**
G. Ramalingam and T. Reps, “On the computational complexity of
dynamic graph problems,” Theoret. Comput. Sci., vol. 158, no. 1, pp.
233–277, 1996.
II) Edge weight decreasing and increasing**
D. Frigioni, A. Marchetti-Spaccamela, and U. Nanni, “Fully dynamic
output bounded single source shortest path problem,” in ACM-SIAM
Symp. Discrete Algorithms, 1996, pp. 212–221.
“Fully dynamic algorithms for maintaining shortest paths trees,”
J. Algorithms, vol. 34, pp. 251–281, 2000.
You can use them directly or change them with respect to the problem and consider node insertion and deletion.
I am going to find a appropriate function in order to obtain accurate similarity between two persons according to their favourites.
for instance persons are connected to tags and their desire to each tags will be kept on the edge of tag nodes as a numeric values. I want to recommend similar persons to each persons.
I have found two solutions:
Cosine Similarity
There is Cosine function in Neo4j that just accept one input while in above function I need to pass vectores to this formula. Such as:
for "a": a=[10, 20, 45] each number indicates person`s desire to each tag.
for "b": b=[20, 50, 70]
Pearson Correlation
When I was surfing on the net and your documentation I found:
http://neo4j.com/docs/stable/cypher-cookbook-similarity-calc.html#cookbook-calculate-similarities-by-complex-calculations
My question is what is your logic behind this formula?
What is difference between r and H?
Because at the first glance I think H1 or H2 are always equals one. Unless I should consider the rest of the graph.
Thank you in advanced for any helps.
I think the purpose of H1 and H2 are to normalize the results of the times property (the number of times the user ate the food) across food types. You can experiment with this example in this Neo4j console
Since you mention other similarity measures you might be interested in this GraphGist, Similarity Measures For Collaborative Filtering With Cypher. It has some simple examples of calculating Pearson correlation and Jaccard similarity using Cypher.
This example makes it a little bit hard to understand what is going on. In this example, H1 and H2 are both 1. a better example would show each person eating different types of food, so you'd be able to see the value of H changing. If "me" also ate "vegetables", "pizza", and "hotdogs", their H would be 4.
Can't help you with Neo4J, just want to point out that Cosine Similarity and Pearsons' correlation coefficient are essentially the same thing. If you decode the different notations, you'll find that the only difference is that Pearsons zero-centers the vectors first. So you can define Pearsons as follows:
Pearsons(a, b) = Cosine(a - mean(a), b - mean(b))