Neo4j Cypher script freezes for too many nodes - neo4j

I have a database with about 300,000 nodes. For comparison purposes with a previous database version, I need to get all the nodes and the number of nodes connected to it or it is connected to.
My cypher query looks like this:
match (node)-[r]-(n) return node.Name, count(n)
And my expected result looks like this:
Name | Count
Node1 | 8
Node2 | 3
Node3 | 5
I'm testing this on Neo4j's web interface (version 3.0.3). For some reason the web interface freezes maybe because of the number of results that I'm getting back so definitely this is a performance issue of the query.
Can this query still be optimized?

This is faster, has less dbhits and will also give you node names of unconnected nodes (if you want).
MATCH (node) RETURN node.Name, size((node)-[]-())

Related

why is neo4j so slow on this cypher query?

I have a fairly deep tree that consists of an initial "transaction" node (call that the 0th layer of the tree), from which there are 50 edges to the next nodes (call it the 1st later of the tree), and then from each of those around 35 on average to the second layer, and so on...
The initial node is a :txnEvent and all the rest are :mEvent
mEvent nodes have 4 properties, one of them called channel_name
Now, I would like to retrieve all paths that go down to the 4th layer such that those paths contain a node with channel_name==A and also channel_name==B
This query:
match (n: txnEvent)-[r:TO*1..4]->(m:mEvent) return COUNT(*);
Is telling me there are only 1,667,444 paths to consider.
However, the following query:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
EXTRACT (n in nodes(p) | n.channel_name),
EXTRACT (n in nodes(p) | n.step),
EXTRACT (n in nodes(p) | n.event_type),
EXTRACT (n in nodes(p) | n.event_device),
EXTRACT (r in relationships(p) | r.weight )
Takes almost 1 minute to execute (neo4j's UI on port 7474)
For completness, neo4j is telling me:
"Started streaming 125517 records after 2 ms and completed after 50789 ms, displaying first 1000 rows."
So I'm wondering whether there's something obvious I'm missing. All of the properties that nodes have are indexed by the way. Is the query slow, or is it fast and the streaming is slow?
UDATE:
This query, that doesn't stream data back:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
COUNT(*)
Takes 35s, so even though it's faster, presumably because no data is returned, I feel it's still quite slow.
UPDATE 2:
Ideally this data should go into a jupyter notebook with a python kernel.
Thanks for the PROFILE plan.
Keep in mind that the query you're asking for is a difficult one to process. Since you want paths where at least one node in the path has one property and at least one other node in the path has another property, there is no way to prune paths during expansion. Instead, every possible path has to be determined, and then every node in each of those 1.6 million paths has to be accessed to check for the property (and that has to be done twice for each path, for both properties). Thus the ~10 million db hits for the filter operation.
You could try expanding your heap and pagecache sizes (if you have the RAM to spare), but I don't see any easy ways to tune this query.
As for your question about the query time vs streaming, the problem is the query itself. The message you saw means that the first result was found extremely quickly so the first result was ready in the stream almost immediately. Results are added to the stream as they're found, but the volume of paths needing to be matched and filtered with no ability to prune paths during expansion means it took a very long time for the query to complete.

Confused about MERGE sometimes creating duplicate relationship [duplicate]

My database model has users and MAC addresses. A user can have multiple MAC addresses, but a MAC can only belong to one user. If some user sets his MAC and that MAC is already linked to another user, the existing relationship is removed and a new relationship is created between the new owner and that MAC. In other words, a MAC moves between users.
This is a particular instance of the Cypher query I'm using to assign MAC addresses:
MATCH (new:User { Id: 2 })
MERGE (mac:MacAddress { Value: "D857EFEF1CF6" })
WITH new, mac
OPTIONAL MATCH ()-[oldr:MAC_ADDRESS]->(mac)
DELETE oldr
MERGE (new)-[:MAC_ADDRESS]->(mac)
The query runs fine in my tests, but in production, for some strange reason it sometimes creates duplicate MacAddress nodes (and a new relationship between the user and each of those nodes). That is, a particular user can have multiple MacAddress nodes with the same Value.
I can tell they are different nodes because they have different node ID's. I'm also sure the Values are exactly the same because I can do a collect(distinct mac.Value) on them and the result is a collection with one element. The query above is the only one in the code that creates MacAddress nodes.
I'm using Neo4j 2.1.2. What's going on here?
Thanks,
Jan
Are you sure this is the entirety of the queries you're running? MERGE has this really common pitfall where it merges everything that you give it. So here's what people expect:
neo4j-sh (?)$ MERGE (mac:MacAddress { Value: "D857EFEF1CF6" });
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1
Properties set: 1
Labels added: 1
1650 ms
neo4j-sh (?)$ MERGE (mac:MacAddress { Value: "D857EFEF1CF6" });
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
17 ms
neo4j-sh (?)$ match (mac:MacAddress { Value: "D857EFEF1CF6" }) return count(mac);
+------------+
| count(mac) |
+------------+
| 1 |
+------------+
1 row
200 ms
So far, so good. That's what we expect. Now watch this:
neo4j-sh (?)$ MERGE (mac:MacAddress { Value: "D857EFEF1CF6" })-[r:foo]->(b:SomeNode {label: "Foo!"});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 2
Relationships created: 1
Properties set: 2
Labels added: 2
178 ms
neo4j-sh (?)$ match (mac:MacAddress { Value: "D857EFEF1CF6" }) return count(mac);
+------------+
| count(mac) |
+------------+
| 2 |
+------------+
1 row
2 ms
Wait, WTF happened here? We specified only the same MAC address again, why is a duplicate created?
The documentation on MERGE specifies that "MERGE will not partially use existing patterns — it’s all or nothing. If partial matches are needed, this can be accomplished by splitting a pattern up into multiple MERGE clauses". So because when we run this path MERGE the whole path doesn't already exist, it creates everything in it, including a duplicate mac address node.
There are frequently questions about duplicated nodes created by MERGE, and 99 times out of 100, this is what's going on.
This is the response I got back from Neo4j's support (emphasis mine):
I got some feedback from our team already, and it's currently known that this can happen in the absence of a constraint. MERGE is effectively MATCH or CREATE - and those two steps are run independently within the transaction. Given concurrent execution, and the "read committed" isolation level, there's a race condition between the two.
The team have done some discussion on how to provided a higher guarantee in the face of concurrency, and do have it noted as a feature request for consideration.
Meanwhile, they've assured me that using a constraint will provide the uniqueness you're looking for.

Cypher performance in graph with large number of relatinships from one node

I have a Neo4j graph (ver. 2.2.2) with large number of relationships. For examaple: 1 node "Group", 300000 nodes "Data", 300000 relationships from "Group" to all existing nodes "Data". I need to check if there is a relationship between set of Data nodes and specific Group node (for example for 200 nodes). But Cypher query I used is very slow. I tried many modifications of this cypher but with no result.
Cypher to create graph:
FOREACH (r IN range(1,300000) | CREATE (:Data {id:r}));
CREATE (:Group);
MATCH (g:Group),(d:Data) create (g)-[:READ]->(d);
Query 1: COST. 600003 total db hits in 730 ms.
Acceptable but I asked only for 1 node.
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000] AND id(g)=300000 RETURN id(d);
Query 2: COST. 600003 total db hits in 25793 ms.
Not acceptable.
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] AND id(g)=300000 RETURN id(d);
Query 3: COST. 1000 total db hits in 309 ms.
This is only one solution I found to make query acceptable. I returned all ids of nodes "Group" and manualy filter result in my code to return only relationships to node with id 300000
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] RETURN id(d), id(g);
Question 1: Total DB hits in query 1 is surprising but I accept that physical model of neoj defines how this query is executed - it needs to look into every existing relation from node "Group". I accept that. But why is so big difference in execution time between query 1 and query 2 if number of db hits is the same (and exucution plan is the same)? I'm only returning id of node, not large set of properties.
Question 2: Is a query 3 the only one solution to optimize this query?
Apparently there is an issue with Cypher in 2.2.x with the seekById.
You can prefix your query with PLANNER RULE in order to make use of the previous Cypher planner, but you'll have to split your pattern in two for making it really fast, tested e.g. :
PLANNER RULE
MATCH (d:Data) WHERE id(d) IN [30]
MATCH (g:Group) WHERE id(g) = 300992
MATCH (d)<-[:READ]-(g)
RETURN id(d)

Fixing inconsistent Cypher results in Neo4j

I've been getting some strange results with a Neo4j database I made (version 2.1.0-M01). I have a graph with the following relationship:
Node[211854]{name : "dst"} <-[:hasrel]- Node[211823]{name : "src"}
I've confirmed this relationship using the following query:
START m=node(211854) MATCH (m)<-[r]-(n) RETURN m,r,n;
which returns a one row result, as expected:
| m | r | n
| Node[211854] | :hasrel[225081] | Node[211823]
The following query returns nothing, however:
START n=node(211823) MATCH (m)<-[r]-(n) RETURN m,r,n
Any thoughts on what might be happening? I've run these queries with and without indexes on the name properties for the nodes.
EDIT: Fixed typo with initial node number.
EDIT2: I rebuilt the server and both queries return the results I expect. Perhaps the error was corruption in the first database?
Using the node id's is not such a good idea, you can use the properties on your node to query them.
For example:
MATCH (m)<-[r]-(n {name: "src"}) RETURN m,r,n;
Does that query return what you expected?
You have to invert the relationship-direction. As you are looking for incoming relationships for your node 211823, this is not one of them. It's an outgoing relationship.
Please also update your database to the current version: 2.1.2 http://neo4j.org/download
START n=node(211823) MATCH (m)-[r]->(n) RETURN m,r,n
Perhaps you should give your nodes and relationships more descriptive names, so you spot easier when you inverted a domain concept.

Retrieving relationships for a single node is slow

I'm getting started with Neo4J 2.0.1 and I'm already running into performance problems that make me think that my approach is wrong. I have a single node type so far (all with the label NeoPerson) and one type of relationship (all with the label NeoWeight). In my test setup, there are about 100,000 nodes and each node has between 0 and 300 relationships to other nodes. There is a Neo4j2.0-style index on NeoPerson's only field, called profile_id (eg CREATE INDEX ON :NeoPerson(profile_id)). Looking up a NeoPerson by profile_id is reasonably fast:
neo4j-sh (?)$ match (n:NeoPerson {profile_id:38}) return n;
+----------------------------+
| n |
+----------------------------+
| Node[23840]{profile_id:38} |
+----------------------------+
1 row
45 ms
However, once I throw relationships into the mix, it gets quite slow.
neo4j-sh (?)$ match (n:NeoPerson {profile_id:38})-[e:NeoWeight]->() return n, e;
+----------------------------------------------------------------------------+
| n | e |
+----------------------------------------------------------------------------+
| Node[23840]{profile_id:38} | :NeoWeight[8178324]{value:384} |
| Node[23840]{profile_id:38} | :NeoWeight[8022460]{value:502} |
...
| Node[23840]{profile_id:38} | :NeoWeight[54914]{} |
+----------------------------------------------------------------------------+
244 rows
2409 ms
My understanding was that traversing relationships from a single node should be quite efficient (isn't that the point of using a graph database?), so why is it taking over 2 seconds for such a simple query on a small data set? I didn't see a way to add an index on a relationship whose keys are the source and/or destination nodes.
People use Neo4j in production without issues. If they have the requirement that the first user query has to return in a few ms, they warm up the caches after server start. E.g. by running their most important use-case queries upfront.
It takes some time to load the nodes and rels from disk. Esp. if the relationships (and their properties) of the single node are distributed across the relationship store file and are loaded from a spinning disk.
For the first query it also takes a bit longer as its query plan has to be built and compiled.
That's why in production you usually use parameters to allow query caching.
What is the use-case you're trying to address?

Resources