neo4j cypher query optimization - neo4j

With 500,000 nodes I'm getting 10-15 seconds, any idea how I can optimize this?
start n=node(*) WHERE HAS(n.score) RETURN n, n.score ORDER BY n.score DESC Limit 5;
from looking around I get the sense that the WHERE clause is slowing it down but I'm not sure how I can use a MATCH on a property of a node.

As Luanne says it takes time because your are searching in all the nodes of your graph.
You could search only in the nodes that has a score property (by indexing them, by searching them from a common node, or - if you're using Neo4j 2 - by labeling them)
See http://docs.neo4j.org/chunked/milestone/indexing.html for further explanations on indexes (which seems to be the more common solution).

With node(*) you're effectively touching your entire graph of 500,000 nodes to check the presence of a property, and the ordering the results. How many rows do you get back?
If you drop your order clause is it any faster?
And what's your use case? Wondering if you can model this differently to avoid a global graph operation. For example, index nodes with the score property, or create a relation from all nodes with the score property to some sort of reference node. Depends on your use case really

Related

neo4j CYPHER - Relationship Query doesn't finish

in a 14 GB database I have a few CITES relationships:
MATCH p=()-[r:CITES]->() RETURN count(r)
91
However, when I run
MATCH ()-[r:CITES]-() RETURN count(r)
it loads forever and eventually crashes with a browser window reload (neo4j desktop)
You can see the differences in how each of those queries will execute if you prefix each query with EXPLAIN.
The pattern used for the first query is such that the planner will find that count in the counts store, a transactionally updated store of counts of various things. This is a fast constant time lookup.
The other pattern, when omitting the direction, will not use the count store lookup and will actually have to traverse the graph (starting from every node in the graph), and that will take a long time as your graph grows.
As for what this gives back, it should actually be twice the number of :CITIES relationships in your graph, since without the direction on the relationship, each individual relationship will be found twice, since the same path with the start and end nodes switched both fit the given pattern.
Neo4j always choose nodes as start points for query execution. In your query, probably the query engine is touching the whole graph, since you are not adding restrictions on node properties, labels, etc.
I think you should specify a label at least in your first node in the pattern.
MATCH (:Article)-[r:CITES]-() RETURN count(r)

Neo4j Cypher: Match and Delete the subgraph based on value of node property

Suppose I have 3 subgraphs in Neo4j and I would like to select and delete the whole subgraph if all the nodes in the subgraph matching the filtering criteria that is each node's property value <= 1. However if there is atleast one node within the subgraph that is not matching the criteria then the subgraph will not be deleted.
In this case the left subgraph will be deleted but the right subgraph and the middle one will stay. The right one will not be deleted even though it has some nodes with value 1 because there are also nodes with values greater than 1.
userids and values are the node properties.
I will be thankful if anyone can suggest me the cypher query that can be used to do that. Please note that the query will be on the whole graph, that is on all three subgraphs or more if there are anymore.
Thanks for the clarification, that's a tricky requirement, and it's not immediately clear to me what the best approach is that will scale well with large graphs, as most possibilities seem to be expensive full graph operations. We'll likely need to use a few steps to set up the graph for easier querying later. I'm also assuming you mean "disconnected subgraphs", otherwise this answer won't work.
One start might be to label nodes as :Alive or :Dead based upon the property value. It should help if all nodes are of the same label, and if there's an index on the value property for that label, as our match operations could take advantage of the index instead of having to do a full label scan and property comparison.
MATCH (a:MyNode)
WHERE a.value <= 1
SET a:Dead
And separately
MATCH (a:MyNode)
WHERE a.value > 1
SET a:Alive
Then your query to mark nodes to delete would be:
MATCH (a:Dead)
WHERE NOT (a)-[*]-(:Alive)
SET a:ToDelete
And if all looks good with the nodes you've marked for delete, you can run your delete operation, using apoc.periodic.commit() from APOC Procedures to batch the operation if necessary.
MATCH (a:ToDelete)
DETACH DELETE a
If operations on disconnected subgraphs are going to be common, I highly encourage using a special node connected to each subgraph you create (such as a single :Cluster node at the head of the subgraph) so you can begin such operations on :Cluster nodes, which would greatly speed up these kind of queries, since your query operations would be executed per cluster, instead of per :Dead node.

Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)?

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

Is a DFS Cypher Query possible?

My database contains about 300k nodes and 350k relationships.
My current query is:
start n=node(3) match p=(n)-[r:move*1..2]->(m) where all(r2 in relationships(p) where r2.GameID = STR(id(n))) return m;
The nodes touched in this query are all of the same kind, they are different positions in a game. Each of the relationships contains a property "GameID", which is used to identify the right relationship if you want to pass the graph via a path. So if you start traversing the graph at a node and follow the relationship with the right GameID, there won't be another path starting at the first node with a relationship that fits the GameID.
There are nodes that have hundreds of in and outgoing relationships, some others only have a few.
The problem is, that I don't know how to tell Cypher how to do this. The above query works for a depth of 1 or 2, but it should look like [r:move*] to return the whole path, which is about 20-200 hops.
But if i raise the values, the querys won't finish. I think that Cypher looks at each outgoing relationship at every single path depth relating to the start node, but as I already explained, there is only one right path. So it should do some kind of a DFS search instead of a BFS search. Is there a way to do so?
I would consider configuring a relationship index for the GameID property. See http://docs.neo4j.org/chunked/milestone/auto-indexing.html#auto-indexing-config.
Once you have done that, you can try a query like the following (I have not tested this):
START n=node(3), r=relationship:rels(GameID = 3)
MATCH (n)-[r*1..]->(m)
RETURN m;
Such a query would limit the relationships considered by the MATCH cause to just the ones with the GameID you care about. And getting that initial collection of relationships would be fast, because of the indexing.
As an aside: since neo4j reuses its internally-generated IDs (for nodes that are deleted), storing those IDs as GameIDs will make your data unreliable (unless you never delete any such nodes). You may want to generate and use you own unique IDs, and store them in your nodes and use them for your GameIDs; and, if you do this, then you should also create a uniqueness constraint for your own IDs -- this will, as a nice side effect, automatically create an index for your IDs.

How to get count for all nodes/edges downstream of some node in Neo4J

I'm wondering, within Cypher if there is a way to get a count of all nodes downstream of some node x.
For my particular use-case I have a number of graphs, which are separate entities, but stored in the same instance. I would like to find out, for each graph, what the node and relationship count is.
I already have this for relationships
start r=rel() return count()
and this for nodes
start n=node() return count()
for everything in the database.
Many thanks,
Eamonn
If you have some "reference" or root node per subgraph you can use path expressions to find all nodes:
start root=node:roots(id="xx")
match root-[*..5]->end
return count(distinct end)
It makes sense to limit the depth of your search.
you must index all your properties in your nodes/rels. then, you must start at these indexes to get the count, and if necessarily, sum them together for each graph.
let's assume we got 2 graphs, book-author type and car-color type. then to get the overal sum of nodes for each graph in cypher:
start g1=node:node_auto_index('bookName:*'), g11=node:node_auto_index('authorName:*'),
g2=node:node_auto_index('carName:*'), g22=node:node_auto_index('carColor:*')
return count(g1)+count(g11) as graph1, count(g2)+count(g22) as graph2
similary for all relationships. i don't know about any cypher solution which could simply group by an undefined property - that could solve the problem easily.

Resources