I have a fairly deep tree that consists of an initial "transaction" node (call that the 0th layer of the tree), from which there are 50 edges to the next nodes (call it the 1st later of the tree), and then from each of those around 35 on average to the second layer, and so on...
The initial node is a :txnEvent and all the rest are :mEvent
mEvent nodes have 4 properties, one of them called channel_name
Now, I would like to retrieve all paths that go down to the 4th layer such that those paths contain a node with channel_name==A and also channel_name==B
This query:
match (n: txnEvent)-[r:TO*1..4]->(m:mEvent) return COUNT(*);
Is telling me there are only 1,667,444 paths to consider.
However, the following query:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
EXTRACT (n in nodes(p) | n.channel_name),
EXTRACT (n in nodes(p) | n.step),
EXTRACT (n in nodes(p) | n.event_type),
EXTRACT (n in nodes(p) | n.event_device),
EXTRACT (r in relationships(p) | r.weight )
Takes almost 1 minute to execute (neo4j's UI on port 7474)
For completness, neo4j is telling me:
"Started streaming 125517 records after 2 ms and completed after 50789 ms, displaying first 1000 rows."
So I'm wondering whether there's something obvious I'm missing. All of the properties that nodes have are indexed by the way. Is the query slow, or is it fast and the streaming is slow?
UDATE:
This query, that doesn't stream data back:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
COUNT(*)
Takes 35s, so even though it's faster, presumably because no data is returned, I feel it's still quite slow.
UPDATE 2:
Ideally this data should go into a jupyter notebook with a python kernel.
Thanks for the PROFILE plan.
Keep in mind that the query you're asking for is a difficult one to process. Since you want paths where at least one node in the path has one property and at least one other node in the path has another property, there is no way to prune paths during expansion. Instead, every possible path has to be determined, and then every node in each of those 1.6 million paths has to be accessed to check for the property (and that has to be done twice for each path, for both properties). Thus the ~10 million db hits for the filter operation.
You could try expanding your heap and pagecache sizes (if you have the RAM to spare), but I don't see any easy ways to tune this query.
As for your question about the query time vs streaming, the problem is the query itself. The message you saw means that the first result was found extremely quickly so the first result was ready in the stream almost immediately. Results are added to the stream as they're found, but the volume of paths needing to be matched and filtered with no ability to prune paths during expansion means it took a very long time for the query to complete.
Related
[Edit] I'm using Neo4j 4.2.1
I have this need for a Cypher query that brings back a complete tree given its root node. All nodes and relationships must be fetched and present only once in the returned sets. Here's what I have come to:
MATCH p = (n)-[*..]->(m)
WHERE id(n) = 0
WITH relationships(p) AS r
WITH distinct last(r) as rel
WITH [node IN [startNode(rel), endNode(rel)] | node] AS tmp, rel
UNWIND tmp AS node
RETURN collect(DISTINCT node) AS nodes, collect(distinct rel) AS relationships;
Running the query on our database to get about 820 nodes makes the thing crash for lack of memory (5Gb allowed). Hard to believe. So I'm wondering : Is this query ill-born? Is there one technique I'm using that shouldn't be used for my purpose?
I strongly recommend that you come up with a node property that is guaranteed to be the same on all the nodes in a contiguous tree, if you don't have one already. I'll call that property same_prop. Here's what I do to run queries like the one you're running:
Index same_prop. If you have different node labels, then you need this index created for each different node label you expect to have in the tree.
CREATE INDEX samepropnode FOR (n:your_label) ON (n.same_prop)
is the kind of thing you need in Neo4j 4+. In Neo4j, indices are cheap, and can sometimes speed up queries quite a bit.
Collect all possible values of same_prop and store them in a text file (I use tab-separated values as safer than comma-separated values).
Use the Python driver, or your language of choice that has a Neo4j driver written (strongly recommend Neo4j-provided drivers, not third-party) to write wrapper code that executes a Cypher query something like this:
MATCH (p)-->(c)
USING INDEX p:your_label(same_prop)
WHERE p.same_prop IN [ same_prop_list ]
RETURN DISTINCT
p.datapiece1 AS `first_parent_datapiece`,
p.datapiecen AS `nth_parent_datapiece`,
c.datapiece1 AS `first_child_datapiece`,
c.datapiecen AS `nth_child_datapiece`
It's not a good idea, in general, to return nodes and relationships unless you're debugging.
Then in your Python (for example) code, you're simply going to read in all your same_prop values from the file you got in Step 2, chunk up the values in reasonable size chunks, maybe 1,000 or 10,000, and substitute them in for the [ same_prop_list ] in the Cypher query on-the-fly.
I have a simple query
MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m
and when executing the query "manually" (i.e. using the browser interface to follow edges) I only get a single node as a result as there are no further connections. Checking this with the query
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o
shows no results and
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE) RETURN m
shows a single node so I have made no mistake doing the query manually.
However, the issue is that the first question takes ages to finish and I do not understand why.
Consequently: What is the reason such trivial query takes so long even though the maximum result would be one?
Bonus: How to fix this issue?
As Tezra mentioned, the variable-length pattern match isn't in the same category as the other two queries you listed because there's no restrictions given on any of the nodes in between n and m, they can be of any type. Given that your query is taking a long time, you likely have a fairly dense graph of :CONNECTION relationships between nodes of different types.
If you want to make sure all nodes in your path are of the same label, you need to add that yourself:
MATCH path = (n:TYPE {id:123})<-[:CONNECTION*]-(m:TYPE)
WHERE all(node in nodes(path) WHERE node:TYPE)
RETURN m
Alternately you can use APOC Procedures, which has a fairly efficient means of finding connected nodes (and restricting nodes in the path by label):
MATCH (n:TYPE {id:123})
CALL apoc.path.subgraphNodes(n, {labelFilter:'TYPE', relationshipFilter:'<CONNECTION'}) YIELD node
RETURN node
SKIP 1 // to avoid returning `n`
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o Is not a fair test of MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m because it excludes the possibility of MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:ANYTHING_ELSE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o.
For your main query, you should be returning DISTINCT results MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN DISTINCT m.
This is for 2 main reasons.
Without distinct, each node needs to be returned the number of times for each possible path to it.
Because of the previous point, that is a lot of extra work for no additional meaningful information.
If you use RETURN DISTINCT, it gives the cypher planner the choice to do a pruning search instead of an exhaustive search.
You can also limit the depth of the exhaustive search using ..# so that it doesn't kill your query if you run against a much older version of Neo4j where the Cypher Planner hasn't learned pruning search yet. Example use MATCH (n:TYPE {id:123})<-[:CONNECTION*..10]<-(m:TYPE) RETURN m
I'm writing a cypher query to load data from my Neo4J DB, this is my data model
So basically what I want is a query to return a Journal with all of its properties and everything related to it, Ive tried doing the simple query but it is not performant at all and my ec2 instance where the DB is hosted runs out of memory quickly
MATCH p=(j:Journal)-[*0..]-(n) RETURN p
I managed to write a query using UNIONS
`MATCH p=(j:Journal)<-[:BELONGS_TO]-(at:ArticleType) RETURN p
UNION
MATCH p=(j:Journal)<-[:OWNS]-(jo:JournalOwner) RETURN p
UNION
MATCH p=(j:Journal)<-[:BELONGS_TO]-(s:Section) RETURN p
UNION

MATCH p=(j:Journal)-[:ACCEPTS]->(fc:FileCategory) RETURN p
UNION
MATCH p=(j:Journal)-[:CHARGED_BY]->(a:APC) RETURN p
UNION
MATCH p=(j:Journal)-[:ACCEPTS]->(sft:SupportedFileType) RETURN p
UNION
MATCH p=(j:Journal)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification) RETURN p
SKIP 0 LIMIT 100`
The query works fine and its performance is not bad at all, the only problem I'm finding is in the limit, I've been googling around and I've seen that post-processing queries with UNIONS is not yet supported.
The referenced github issue is not yet resolved, so post processing of UNION is not yet possible github link
Logically the first thing I tried when I came across this issue was to put the pagination on each individual query, but this had some weird behaviour that didn't make much sense to myself.
So I tried to write the query without using UNIONS, I came up with this
`MATCH (j:Journal)
WITH j LIMIT 10
MATCH pa=(j)<-[:BELONGS_TO]-(a:ArticleType)
MATCH po=(j)<-[:OWNS]-(o:JournalOwner)
MATCH ps=(j)<-[:BELONGS_TO]-(s:Section)
MATCH pf=(j)-[:ACCEPTS]->(f:FileCategory)
MATCH pc=(j)-[:CHARGED_BY]->(apc:APC)
MATCH pt=(j)-[:ACCEPTS]->(sft:SupportedFileType)
MATCH pl=(j)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification)
RETURN pa, po, ps, pf, pc, pt, pl`
This query however breaks my DB, I feel like I'm missing something essential for writing CQL queries...
I've also looked into COLLECT and UNWIND in this neo blog post but couldn't really make sense of it.
How can I paginate my query without removing the unions? Or is there any other way of writing the query so that pagination can be applied at the Journal level and the performance isn't affected?
--- EDIT ---
Here is the execution plan for my second query
You really don't need UNION for this, because when you approach this using UNION, you're getting all the related nodes for every :Journal node, and only AFTER you've made all those expansions from every :Journal node do you limit your result set. That is a ton of work that will only be excluded due to your LIMIT.
Your second query looks like the more correct approach, matching on :Journal nodes with a LIMIT, and only then matching on the related nodes to prepare the data for return.
You said that the second query breaks your DB. Can you run a PROFILE on the query (or an EXPLAIN, if the query never finishes execution), expand all elements of the plan, and add it to your description?
Also, if you leave out the final MATCH to :Classification, does the query behave correctly?
It would also help to know if you really need the paths returned, or if it's enough to just return the connected nodes.
EDIT
If you want each :Journal and all its connected data on a single row, you need to either be using COLLECT() after each match, or using pattern comprehension so the result is already in a collection.
This will also cut down on unnecessary queries. Your initial match (after the limit) generated 31k rows, so all subsequent matches executed 31k times. If you collect() or use pattern comprehension, you'll keep the cardinality down to your initial 10, and prevent redundant matches.
Something like this, if you only want collected paths returned:
MATCH (j:Journal)
WITH j LIMIT 10
WITH j,
[pa=(j)<-[:BELONGS_TO]-(a:ArticleType) | pa] as pa,
[po=(j)<-[:OWNS]-(o:JournalOwner) | po] as po,
[ps=(j)<-[:BELONGS_TO]-(s:Section) | ps] as ps,
[pf=(j)-[:ACCEPTS]->(f:FileCategory) | pf] as pf,
[pc=(j)-[:CHARGED_BY]->(apc:APC) | pc] as pc,
[pt=(j)-[:ACCEPTS]->(sft:SupportedFileType) | pt] as pt,
[pl=(j)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification) | pl] as pl
RETURN pa, po, ps, pf, pc, pt, pl
I am experimenting with Neo4j using a simple dataset of Locations. A location can have a relation to another relation.
a:Location - [rel] - b:Location
I already have the locations in the database (roughly 700.000+ Location entries)
Now I wanted to add the relation data (170M Edges), but I wanted to experiment with the import logic with a smaller set first, so I basically picked 2 nodes that are in the set and tried to create a relationship as follows.
MERGE p =(a:Location {locationid: 3616})-[w:WikiLink]->(b:Location {locationid: 467501})
RETURN p;
and also tried the approach directly from the docu
MATCH (a:Person),(b:Person)
WHERE a.name = 'Node A' AND b.name = 'Node B'
CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b)
RETURN r
I tried using a directional merge, undirectional merge, etc. etc. I basically tried multiple variants of the above queries and the result is: They run forever, seeming to no complete even after 15 minutes. Which is very odd.
Indexes
ON :Location(locationid) ONLINE (for uniqueness constraint)
Constraints
ON (location:Location) ASSERT location.locationid IS UNIQUE
This is what I am currently using:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///edgelist.csv' AS line WITH line
MATCH (a:Location {locationid: toInt(line.locationidone)}), (b:Location {locationid: toInt(line.locationidtwo)})
MERGE (a)-[w:WikiLink {weight: toFloat(line.edgeweight)}]-(b)
RETURN COUNT(w);
If you look at the terminal output below you can see Neo4j reports 258ms query execution time, the realtime is however somewhat above that. This query already takes a few seconds too much in my opinion (The machine this runs on has 48GB RAM, 16 Cores and is relatively new).
I am currently running this query with LIMIT 1000 (before it was LIMIT 1) but the script is already running for a few minutes. I wonder if I have to switch from MERGE to CREATE. The problem is, I cannot understand the callgraph that EXPLAIN gives me in order to determine the bottleneck.
time /usr/local/neo4j/bin/neo4j-shell -file import-relations.cql
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| p |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [Node[758609]{title:"Tehran",locationid:3616,locationlabel:"NIL"},:WikiLink[9422418]{weight:1.2282325516616477E-7},Node[917147]{title:"Khorugh",locationid:467501,locationlabel:"city"}] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
Relationships created: 1
Properties set: 1
258 ms
real 0m1.417s
user 0m1.497s
sys 0m0.158s
If you haven't:
create constraint on loc:Location assert loc.locationid is unique;
Then find both nodes, and create the releationship.
MATCH (a:Location {locationid: 3616}),(b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
or if the locations don't exist yet:
MERGE (a:Location {locationid: 3616})
MERGE (b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
You should also use parameters if you do that from a program.
Have you indexed the Location nodes on locationid?
CREATE INDEX ON :Location(locationid)
I had a similar problem adding edges to a graph and indexing the nodes led to the linking running over 150x faster.
If the nodes aren't indexed neo4j will do a serial search for the two nodes to link together.
USING PERIODIC COMMIT <value>:
Specifies number of records(rows) to be commited in a transaction. Since you have high RAM, it is good to use value that is greater than 100000. This will reduce the number of transactions committed and might further reduce the overall time.
I am creating simple graph db for tranportation between few cities. My structure is:
Station = physical station
Stop = each station has several stops, depend on time and line ID
Ride = connection between stops
I need to find route from city A to city C, but i has no direct stopconnection, but they are connected thru city B. see picture please, as new user i cant post images to question.
How can I get router from City A with STOP 1 connect RIDE 1 to STOP 2 then
STOP 2 connected by same City B to STOP3 and finnaly from STOP3 by RIDE2 to STOP4 (City C)?
Thank you.
UPDATE
Solution from Vince is ok, but I need set filter to STOP nodes for departure time, something like
MATCH p=shortestPath((a:City {name:'A'})-[*{departuretime>xxx}]-(c:City {name:'C'})) RETURN p
Is possible to do without iterations all matches collection? because its to slow.
If you are simply looking for a single route between two nodes, this Cypher query will return the shortest path between two City nodes, A and C.
MATCH p=shortestPath((a:City {name:'A'})-[*]-(c:City {name:'C'})) RETURN p
In general if you have a lot of potential paths in your graph, you should limit the search depth appropriately:
MATCH p=shortestPath((a:City {name:'A'})-[*..4]-(c:City {name:'C'})) RETURN p
If you want to return all possible paths you can omit the shortestPath clause:
MATCH p=(a:City {name:'A'})-[*]-(c:City) {name:'C'}) RETURN p
The same caveats apply. See the Neo4j documentation for full details
Update
After your subsequent comment.
I'm not sure what the exact purpose of the time property is here, but it seems as if you actually want to create the shortest weighted path between two nodes, based on some minimum time cost. This is different of course to shortestPath, because that minimises on the number of edges traversed only, not the cost of those edges.
You'd normally model the traversal cost on edges, rather than nodes, but your graph has time only on the STOP nodes (and not for example on the RIDE edges, or the CITY nodes). To make a shortest weighted path query work here, we'd need to also model time as a property on all nodes and edges. If you make this change, and set the value to 0 for all nodes / edges where it isn't relevant then the following Cypher query does what I think you need.
MATCH p=(a:City {name: 'A'})-[*]-(c:City {name:'C'})
RETURN p AS shortestPath,
reduce(time=0, n in nodes(p) | time + n.time) AS m,
reduce(time=0, r in relationships(p) | time + r.time) as n
ORDER BY m + n ASC
LIMIT 1
In your example graph this produces a least cost path between A and C:
(A)->(STOP1)-(STOP2)->(B)->(STOP5)->(STOP6)->(C)
with a minimum time cost of 230.
This path includes two stops you have designated "bad", though I don't really understand why they're bad, because their traversal costs are less than other stops that are not "bad".
Or, use Dijkstra
This simple Cypher will probably not be performant on densely connected graphs. If you find that performance is a problem, you should use the REST API and the path endpoint of your source node, and request a shortest weighted path to the target node using Dijkstra's algorithm. Details here
Ah ok, if the requirement is to find paths through the graph where the departure time at every stop is no earlier than the departure time of the previous stop, this should work:
MATCH p=(:City {name:'A'})-[*]-(:City {name:'C'})
MATCH (a:Stop) where a in nodes(p)
MATCH (b:Stop) where b in nodes(p)
WITH p, a, b order by b.time
WITH p as ps, collect(distinct a) as as, collect(distinct b) as bs
WHERE as = bs
WITH ps, last(as).time - head(as).time as elapsed
RETURN ps, elapsed ORDER BY elapsed ASC
This query works by matching every possible path, and then collecting all the stops on each matched path twice over. One of these collections of stops is ordered by departure time, while the other is not. Only if the two collections are equal (i.e. number and order) is the path admitted to the results. This step evicts invalid routes. Finally, the paths themselves are ordered by least elapsed time between the first and last stop, so the quickest route is first in the list.
Normal warnings about performance, etc. apply :)