List incoming edges in Neo4j from Java slow - neo4j

I have a large graph (20M+ nodes) stored in a Neo4J database.
From a Java program, I need to retrieve the sets of incoming edges of several nodes and compute their intersection.
The indegree of some nodes in my graph is quite high (>300,000).
I tried the following:
on a node with 390,241 incoming edges and it took 10 seconds to complete.
However, if I store the node relations in a PostgreSQL database I can retrieve all incoming relations of this node in less than 1 second.
The problem is that I have to repeat this operation for several (possibly, high-indegree) nodes, which takes a lot of time. I thought that after warmup the Neo4j database would react faster, but it is not the case.
for ( Relationship r : source.getRelationships(Direction.INCOMING, DynamicRelationshipType.withName("Link")) )
{
Node other = r.getOtherNode(source);
set.add(other);
}
I don't really want to replicate my graph in PostgreSQL only for this operation, so I would need some hints (if any) to speed up Neo4j.
My computer has 32G of RAM, the Java heap space for my program is 20G.
All the other settings are as per default values.
Thanks in advance for your help!

Related

graph data science library stream too slow & how to retrieve node label(type)?

Q1.
We're trying to perform random walk and
I followed the example,
https://github.com/neo4j/graph-data-science-client/blob/main/examples/import-sample-export-gnn.ipynb
our graph consists of 170 million nodes, 1700 million edges
and set enough memory for both heap(256GB) and page(10GB)
we set sampling ratio to sample 10000 nodes,
random walk takes short time,
but retrieving graph to python dataframe via stream takes forever,
is there something that I'm missing here? e.g. indexing, etc
my understanding is, gds basically
project graph from disk to memory by gds.project
perform graph operation and hold the graph path by gds.rwr
fetch the actual node/edge properties by gds.stream
I don't think there's indexing necessary in this circumstance, I'm not an expertise in DB though.
Q2.
We're trying to stream node label from catalog graph, but I can't find any function for that.
How should I fetch node label(type)?
for Q1, after long-wait, We retrieved the graph with desired node count,
I think stream took long time because, it has 40MIL edges
is there some way to limit the edges as well?
I see that there used to be walkLength, walksPerNode
https://community.neo4j.com/t5/neo4j-graph-platform/how-is-the-number-of-random-walks-determined-in-gds/m-p/50250
is there equivalent for gds.alpha.graph.sample.rwr?

How to explain improving speed with larger data in Neo4j Cypher?

I make experiments with querying Neo4j data whose size gradually increases. Some of my query expressions behave as I would expect - a larger graph means a longer execution time. But, e.g., the query
MATCH (`g`:Graph)
MATCH (`g`)<-[:`Graph`]-(person)-[:birthPlace {Value: "http://persons/data/birthPlace"}]->(place)-[:`Graph`]->(`g`)
WITH count(person.Value) AS persons, place
WHERE persons > 5
RETURN place.Value AS place, persons
ORDER BY persons
has these execution times (in ms):
|80.2 |208.1 |301.7 |399.23 |0.1 |2.07 |2.61 |2.81 |7.3 |1.5 |.
How to explain the rapid acceleration from the fifth step? The data are the same, just extended; no indexes were created.
The data on 4th experiment:
201673 nodes,
601189 relationships,
859225 properties.
The data size on the 5th experiment:
242113 nodes,
741500 relationships,
1047060 properties.
All I can think about is that maybe Cypher will start using some indexes from a certain data size, but I can't find anywhere if that's the case.
Thank you for any comments.
Neo4j cache management may explain your observations. You might explain what you are doing more precisely. What version of Neo4j are you using? What is the Graph node? You are repeatedly running the same query on graph and trying this again with a larger or smaller graph?
If you are running the same query multiple times on the same data set with more rapid execution times, then the cache may be the reason. In v 3.5 and earlier it would "warm up" with repeated execution. Putatively this does not occur in v 4.x.
You might look at
cold start
or these tips. You might also look at your transaction log; is it accumulate large files.
Why the '' around your node identifiers ('g'); just use (g:Graph) and [r:Graph]; no quotes.

Neo4j multiple node labels and performance

According to my Spring Data Neo4j 4(SDN4) class hierarchy I have a lot of Neo4j nodes with ~7 labels per each node.
Should I worry about the performance of my application with such number of labels per node or Neo4j labels( and theirs usage in SDN 4) don't impact the performance ?
Behind every label is an index. So a high number of labels per node will increase the write time for any such node. If you're doing mass updates this will be noticable but for a regular application you will hardly notice the difference on writes. For reads it makes no difference.
Hope this helps,
Tom

Neo4j modelling advice

I am developing a realtime chat for people selling/buying items
I would like to know what is the most performant way to implement in Neo4j the storing of the messages in a room. (I can see 2 options)
1) add a messages array property to the Room node.
2) make the messages as nodes and have a "NEXT" relation between them.
What option would be the most performant for Neo4j ?
Would just adding a value to the messages array would be easier to deal with for Neo4j ?
From a performance point of view, the costs of the operations with Neo4j are as follows:
Find a node: O(1)
Transverse a relationship: O(1)
If you store every message in a single node, you only have to find one node, so the total cost of the operation is O(1) (constant)
But if you store every message in its own node with a relationship of NEXT between each message, to extract N messages, you need to find N nodes, so the cost becomes N * 2 * O(1) = O(N) (linear, and 2 because, 1 for finding, and 1 for transversing)
So with this in mind, seems that the having all the messages in a single node is better, but of course, the base cost of getting a node with a lot of information in it, might take a bit longer than getting a node that is smaller, so in order to make sure, I'd suggest to measure the time it takes to load a node with all of the messages in it, with different sizes to see how it scales, and then you can decide:
If it scales in a linear way => both will have similar performance
If doesnt scale linearly:
less than linear => one node with all the messages will be better
more than linear => a node per message will be better.
I suspect it will be less than linear, but assumptions aren't a good guide, so better check it.
If you are using Java 8 in your app, one way you can measure the operation time is using:
Instant start = Instant.now();
// operation
Instant end = Instant.now();
Duration.between(start,end);

Neo takes 13 seconds to count 0 nodes

I've been experimenting with Neo4j for a while now. I've been inserting data from the medical terminology database SNOMED in order to experiment with a large dataset.
While experimenting, I have repeatedly inserted and then deleted around 450,000 nodes. I'm finding Neo's performance to be somewhat unsatisfactory. For example, I have just removed all nodes from the database. Running the query:
match (n) return count (n)
takes 13085 ms to return 0 nodes.
I'm struggling to understand why it might take this time to count 0 things. Does Neo retain some memory of nodes that are deleted? Is it in some way hobbled by the fact that I have inserted and removed large amounts of nodes in the past? Might it perform better if I delete the data directory instead of deleting all nodes with Cypher?
Or is there some twiddling with memory allocation and so forth that might help?
I'm running it on an old-ish laptop running Linux Mint.
It's partly due to neo4j's store format. Creating new nodes or relationships assigns them ids, where ids are actual offsets into the store files. Deleting a node or relationship marks that record as not in use in the store file. Looking at all nodes is done by scanning the node store file to spot records that are in use.

Resources