Cypher apoc.export.json.query is painstakingly slow - neo4j

I'm trying to export subgraph (all nodes and relationships on some path) from neo4j to json.
I'm running a Cypher export query with
WITH "{cypher_query}" AS query CALL apoc.export.json.query(query, "filename.jsonl", {}) YIELD file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data
RETURN file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data;
Where cypher_query is
MATCH p = (ancestor: Term {term_id: 'root_id'})<-[:IS_A*..]-(children: Term) WITH nodes(p) as term, relationships(p) AS r, children AS x RETURN term, r, x"
Ideally, I'd have the json be triples of subject, relationship, object of (node1, relationship between nodes, node2) - my understanding is that in this case I'm getting more than two nodes per line because of the aggregation that I use.
It takes more than two hours to export something like 80k nodes and it would be great to speed up this query.
Would it benefit from being wrapped in apoc.periodic.iterate? I thought apoc.export.json.query is already optimized with this regard, but maybe I'm wrong.
Would it benefit from replacing the path-matching query in standard cypher syntax with some apoc function?
Is there a more efficient way of exporting a subgraph from a neo4j database to json? I thought that maybe creating a graph object and exporting it would work but have no clue where the bottleneck is here and hence don't know how to progress.

You could try this (although I do not see why you would need the rels in the result, unless they have properties)
// limit the number of paths
MATCH p = (root: Term {term_id: 'root_id'})<-[:IS_A*..]-(leaf: Term)
WHERE NOT EXISTS ((leaf)<-[:IS_A]-())
// extract all relationships
UNWIND relationships(p) AS rel
// Return what you need (probably a subset of what I indicated below, eg. some properties)
RETURN startNode(rel) AS child,
rel,
endNode(rel) AS parent

Related

Neo4j query all triples fails

I have a graph with 300k nodes and 4M relationships.
I'd like to query all triples:
MATCH p=()-[]->()
RETURN p
I get the following error:
Neo.DatabaseError.Statement.ExecutionFailed
org.neo4j.io.pagecache.CursorException: PropertyRecord claims to have more property blocks than can fit in a record
Do you know what goes wrong? Thanks.
This is a way to export all nodes and relationships into a csv file using APOC function.
Ref: https://neo4j.com/labs/apoc/4.1/export/csv/
For example; to download all nodes and relationships of Movies database
CALL apoc.export.csv.all("movies.csv", {})
OR if you want to add your own query, see sample below:
MATCH (person:Person)
WHERE person.name STARTS WITH "L"
WITH collect(person) AS people
CALL apoc.export.csv.data(people, [], "movies-l.csv", {})
YIELD file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data
RETURN file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data
==================
Why do you need to see 300k nodes and 8M relationships in one browser?
You can use alternatives below:
1 call db.schema.visualization() -> a simplified view of the database
2 MATCH p=()-[]->()
RETURN p
LIMIT 25 -> limits few nodes to view

How to express constraints on consecutive relations of the Neo4j shortestpath query algoritm?

I would like to query a shortestpath in Neo4j but expressing conditions between consecutive relations.
Suppose I have nodes labelled type and relations labelled rel. Such relations have the attributes start_time, end_time, exec_time (for the moment they are of type string, but if you prefer you can consider them as integer). I would like to find the shortest path between two nodes b1 and b2 subject to the constraint that:
the relation outgoing from b1 should have the attribute starting_time bigger than a given value (let me call it th;
if there are more than one of such relations, starting_time of the next relation should be bigger than ending_time of the previous.
Between two nodes I can have multiple realations.
I started from this query limiting the relations with starting_time bigger than th.
MATCH (b1:type{id:"0247"}), (b2:type{id:"0222"}),
p=shortestPath((b1)−[t:rel*]−>(b2))
WHERE ALL (r in relationships(p) WHERE r.starting_time>"14:56:00" )
RETURN p;
I was trying something like this:
MATCH (b1:type{id:"0247"}), (b2:type{id:"0222"}),
p=shortestPath((b1)−[t:rel*]−>(b2))
WITH "14:56:00" as th
WHERE ALL (r in relationships(p) WHERE r.starting_tme>th WITH r.end_time as th )
RETURN p;
but it does not work and I am not sure the shortestPath algorithm in Neo4j accesses the relations of the shortest path sequentially.
How can I express such a condition in Neo4j cypher query language?
If it is not possible is there a suitable way to model such a time condition in a graph database (I mean how can I change the DB?)
This query may do what you want:
MATCH p = shortestPath((b1:type{id:"0247"})−[t:rel*]−>(b2:type{id:"0222"}))
WHERE REDUCE(s = {ok: true, last: "14:56:00"}, r IN RELATIONSHIPS(p) |
{ok: s.ok AND r.starting_time > s.last, last: r.end_time}
).ok
RETURN p;
This query uses REDUCE to iteratively test the relationships while updating the current state s at every step.

Neo4J match and create relationship is very very slow with few millions records

I have about 3.5M nodes with label A and about 400 nodes with label B.
Nodes with label B already have directed relation like (b1:B)-(c:CONNECTS)->(b2:B) now I need to add 3.5M another type of relationships by comparing A node properties with :CONNECTS relationship properties.
My statement looks like this:
MATCH (a:A)
MATCH (c:C)
MATCH (b1:B {id: a.a1_id})-[rl:CONNECTS*1..21]->(b2:B {id: a.b2_id}) WHERE ALL(x in rl WHERE x.connect_id = c.connect_id)
MATCH (new_a:B)-[r:TO]->(new_b:B) WHERE r in rl
CREATE (new_a)-[:TICKET {ticket_id: ID(a)}]->(new_b)
This statement is extremely slow and just hangs up. I even tried to do some performance tuning mentioned here, especially I allocated heap size to 16GB.
I find it quite strange that it can't handle this size of data. What am I missing? I tried to model differently and reduce relationship queries and use more schema index, but I failed to do a lot differently because of type of data I have and type of query I want to perform after all data is there.
I also tried to use periodic commit while creating A nodes with csv import. It has same issues.
I hope I am clear enough. I would really appreciate some inputs. Thanks.
What are the labels A, B, C ? A CONNECTS relationship is also free of meaning.
Queries like this are meant to be comprehensible not the opposite!
// generates 3.5M rows
MATCH (a:A)
// generates x-times 3.5M rows
// you never use that C except for checking an connect id?
MATCH (c:C)
// many million times execute this variable length expand
MATCH (b1:B {id: a.a1_id})-[rl:CONNECTS*1..21]->(b2:B {id: b2_id})
WHERE ALL(x in rl WHERE x.connect_id = c.connect_id)
// lookup by relationship is very bad esp. as you looking over a cross product of all 400x400 B's
MATCH (new_a:B)-[r:TO]->(new_b:B) WHERE r in rl
// why do you store the id of a on this self!!-relationship?
CREATE (new_b)-[:TICKET {ticket_id: ID(a)}]->(new_b);
Where does b2_id come from?
Perhaps something like this:
MATCH (a:A)
MATCH (b1:B {id: a.a1_id})
MATCH (b2:B {id: {b2_id}})
MATCH (b1)-[rels:CONNECTS*..21]->(b2)
WHERE ALL(x in tail(rels) WHERE x.connect_id = head(rels).connect_id)
UNWIND rels AS r
WITH a,startNode(r) as new_a, endNode(r) as new_b
CREATE (new_a)-[:TICKET {ticket_id: ID(a)}]->(new_b);

neo4j cypher, performance on specifying labels when id(n) is in conditions

has anyone tested/knows if - when querying a Neo4j database with Cypher - specifying the
MATCH node:labels
makes the selection faster even if
WHERE id(node) = x
is in place?
MATCH (n)
WHERE ID(n) = {x}
RETURN n
should be negligibly faster than
MATCH (n:MyLabel)
WHERE ID(n) = {x}
RETURN n
Both queries first get the node by internal id but while the first query returns the second filters the result on hasLabel(n:MyLabel).
It's a good example of where its possible to overuse labels. Similarly, if for the pattern (a:Person {name:"Étienne Gilson"})-[:FRIEND]->(b:Person)-[:FRIEND]->(c:Person) I know that only :Person nodes have -[:FRIEND]- relationships, there is no point in filtering b and c on that label. If a remote node should be retrieved from index then the label should be included to indicate that, i.e. -[:FRIEND]->(b:Person {name:"Jacques Maritain"}), but when no (indexed) property is included in that part of the pattern, the nodes will be reached by traversal and if only people have friends an additional filter on hasLabel(b:Person) would be pointless.
I understand this blog post to mean that filtering on a label is as cheap as a bitwise &, so performance difference should be very small.

Cypher query to find all paths with same relationship type

I'm struggling to find a single clean, efficient Cypher query that will let me identify all distinct paths emanating from a start node such that every relationship in the path is of the same type when there are many relationship types.
Here's a simple version of the model:
CREATE (a), (b), (c), (d), (e), (f), (g),
(a)-[:X]->(b)-[:X]->(c)-[:X]->(d)-[:X]->(e),
(a)-[:Y]->(c)-[:Y]->(f)-[:Y]->(g)
In this model (a) has two outgoing relationship types, X and Y. I would like to retrieve all the paths that link nodes along relationship X as well as all the paths that link nodes along relationship Y.
I can do this programmatically outside of cypher by making a series of queries, the first to
retrieve the list of outgoing relationships from the start node, and then a single query (submitted together as a batch) for each relationship. That looks like:
START n=node(1)
MATCH n-[r]->()
RETURN COLLECT(DISTINCT TYPE(r)) as rels;
followed by:
START n=node(1)
MATCH n-[:`reltype_param`*]->()
RETURN p as path;
The above satisfies my need, but requires at minimum 2 round trips to the server (again, assuming I batch together the second set of queries in one transaction).
A single-query approach that works, but is horribly inefficient is the following single Cypher query:
START n=node(1)
MATCH p = n-[r*]->() WHERE
ALL (x in RELATIONSHIPS(p) WHERE TYPE(x) = TYPE(HEAD(RELATIONSHIPS(p))))
RETURN p as path;
That query uses the ALL predicate to filter the relationships along the paths enforcing that each relationship in the path matches the first relationship in the path. This, however, is really just a filter operation on what it essentially a combinatorial explosion of all possible paths --- much less efficient than traversing a relationship of a known, given type first.
I feel like this should be possible with a single Cypher query, but I have not been able to get it right.
Here's a minor optimization, at least non-matching the paths will fail fast:
MATCH n-[r]->()
WITH distinct type(r) AS t
MATCH p = n-[r*]->()
WHERE type(r[-1]) = t // last entry matches
RETURN p AS path
This is probably one of those things that should be in the Java API if you want it to be really performant, though.

Resources