Im new to neo4j and I am looking to seek help.
I have 2 entities named entity1 and entity2 and their relationship is defined (CEO) ,however I was successful to load the data and form the relationship using merge
confidence value | Entity1 | Entity2 | Relationship
0.884799964 |Jamie Dimon | JPMorgan Chase | CEO
0.884799964 |Jamie Dimon | JPMorgan Chase | CEO
0.380894504 |Jamie Dimon | JPMorgan Chase | CEO
My question : the confidence value is 0.88 and 0.38 for Jamie Dimon , Now I want to display a single relationship between Jamie Dimon and JPMorgan which holds the maximum confidence value (0.88)
With this query I was able to display 2 relationships with confidence value 0.88 and 0.38 instead of 3 relationships, but I want a single relationship which holds maximum confidence.
LOAD CSV WITH HEADERS FROM 'file:/result_is_of_neo4j_final1.csv' AS line
MERGE (e1:Entity1 {name: line.relation_first, e1_confidence: toFloat(line.entities_0_confidence)})
WITH line, e1
MERGE (e2:Entity2 {name : line.relation_second, e2_confidence: toFloat(line.entities_1_confidence)})
WITH e2, e1, line
MERGE (e1)-[r:IS_FROM {relation : line.relation_relation, r_confidence: toFloat(line.relation_confidence)}]->(e2)
RETURN e1,r,e2
How much data are you planning to load this way? If it's a large import, it's likely you'll want to use PERIODIC COMMIT to speed up the import and avoid memory issues. However, doing so will also impact any kind of comparison and conditional logic in your import, as it's not guaranteed that the rows you need to compare are being executed within the same transaction.
I'd recommend importing all of your nodes and relationships without any extra logic, and then running a query after all the data is loaded to remove the unnecessary relationships.
This query should work for you after the graph is loaded. For every :IS_FROM relationship of a given relation, it will only keep the relationship with the highest r_confidence, and delete the others:
MATCH (e1:Entity1)-[r:IS_FROM]->(e2:Entity2)
WITH e1, r.relation as relation, COLLECT(r) as rels, MAX(r.r_confidence) as maxConfidence, e2
WHERE SIZE(rels) > 1
WITH FILTER(r in rels WHERE r.r_confidence <> maxConfidence) as toDelete
FOREACH (rel in toDelete | DELETE rel)
EDIT
If you need to get rid of duplicate relationships too, then an alternate approach that should work better might be to order your relationships of a specific relation between two nodes by confidence, and delete all except the first:
MATCH (e1:Entity1)-[r:IS_FROM]->(e2:Entity2)
WITH e1, r.relation as relation, r, e2
ORDER BY r.r_confidence DESC
// since we just sorted, the collection will be in order
WITH e1, relation, COLLECT(r) as rels, e2
// delete all other relationships except the top
FOREACH (rel in TAIL(rels) | DELETE rel)
Related
Background
I want to create a histogram of the relationships starting from a set of nodes.
Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ].
The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):
id_0:
- ACTED_IN: 14
- DIRECTED: 1
id_1:
- DIRECTED: 12
- WROTE: 5
- ACTED_IN: 2
id_2:
...
The current cypher query I've written is:
MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count
It returns the pair of [ id, type ] count like:
id | rel type | count
id0 | ACTED_IN | 14
id0 | DIRECTED | 1
id1 | DIRECTED | 12
id1 | WROTE | 5
id1 | ACTED_IN | 2
...
The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).
Problem
Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.
Is there more efficient way to collect the histogram for a set of nodes?
Could this query be parallelized? (With java code or using UNION?)
Is something wrong with how I set up my neo4j database, should these queries be this slow?
There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.
Bit of background :
Using count, will execute an expandAll, which is as expensive as the number of relationships a node has
PROFILE
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)
Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient
PROFILE
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())
Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).
But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.
The flow is :
Get all the relationship types from the database ( fast )
Get the nodes from id list ( fast )
Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL apoc.cypher.run("RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count
in brief: how can we MERGE multiple nodes and relations just like the way we do with MATCH and CREATE: we can do multiple CREATE or MATCH for nodes or relations, separated with comma, but this action is not allowed with MERGE
in detail: suppose I have two graphs:
G1: (a)-[r1]->(b)<-[r2]-(c)
G2: (a)-[r1]->(b)<-[r3]-(d)
I have G1 inserted in neo4j, and G2 ready to push to db. The normal way to do it is to merge each node pair and then merge the relation; in this example for r1 relation there would be no change in db, since G1 already has the relation, however for the second one, my CQL first create node d then add relation r3
Is there a way to push G2 to db in one step? something like:
MERGE (a), (b), (c), (a)-[r1]->(b)<-[r3]-(d)
to create such result:
(a)-[r1]->(b)<-[r2]-(c)
^
|
[r3]
|
(d)
Not with a single MERGE statement. You would need to follow the pattern of doing a MERGE for each node, then a MERGE for each relationship.
That said, Neo4j does use transactions, so while this is broken into multiple clauses in your Cypher query, the transaction is applied atomically when committed.
I am trying to load 500000 nodes ,but the query is not executed successfully.Can any one tell me the limitation of number of nodes in neo4j community edition database?
I am running these queries
result = session.run("""
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///relationships.csv" AS row
merge (s:Start {ac:row.START})
on create set s.START=row.START
merge (e:End {en:row.END})
on create set s.END=row.END
FOREACH (_ in CASE row.TYPE WHEN "PAID" then [1] else [] end |
MERGE (s)-[:PAID {cr:row.CREDIT}]->(e))
FOREACH (_ in CASE row.TYPE WHEN "UNPAID" then [1] else [] end |
MERGE (s)-[:UNPAID {db:row.DEBIT}]->(e))
RETURN s.START as index, count(e) as connections
order by connections desc
""")
I don't think the community edition is more limited than the enterprise edition in that regard, and most of the limits have been removed in 3.0.
Anyway, I can easily create a million nodes (in one transaction):
neo4j-sh (?)$ unwind range(1, 1000000) as i create (n:Node) return count(n);
+----------+
| count(n) |
+----------+
| 1000000 |
+----------+
1 row
Nodes created: 1000000
Labels added: 1000000
3495 ms
Running that 10 times, I've definitely created 10 million nodes:
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 10000000 |
+----------+
1 row
3 ms
Your problem is most likely related to the size of the transaction: if it's too large, it can result in an OutOfMemory error, and before that it can slow the instance to a crawl because of all the garbage collection. Split the node creation in smaller batches, e.g. with USING PERIODIC COMMIT if you use LOAD CSV.
Update:
Your query already includes USING PERIODIC COMMIT and only creates 2 nodes and 1 relationship per line from the CSV file, so it most likely has to do with the performance of the query itself than the size of the transaction.
You have Start nodes with 2 properties set with the same value from the CSV (ac and START), and End nodes also with 2 properties set with the same value (en and END). Is there a unicity constraint on the property used for the MERGE? Without it, as nodes are created, processing each line will take longer and longer as it needs to scan all the existing nodes with the wanted label (an O(n^2) algorithm, which is pretty bad for 500K nodes).
CREATE CONSTRAINT ON (n:Start) ASSERT n.ac IS UNIQUE;
CREATE CONSTRAINT ON (n:End) ASSERT n.en IS UNIQUE;
That's probably the main improvement to apply.
However, do you actually need to MERGE the relationships (instead of CREATE)? Either the CSV contains a snapshot of the current credit relationships between all Start and End nodes (in which case there's a single relationship per pair), or it contains all transactions and there's no real reason to merge those for the same amount.
Finally, do you actually need to report the sorted, aggregated result from that loading query? It requires more memory and could be split into a separate query, after the loading has succeeded.
I would like to query for various things and returned a combined set of relationships. In the example below, I want to return all people named Joe living on Main St. I want to return both the has_address and has_state relationships.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
RETURN r, r1;
But when I run this query in the Neo4J browser and look under the "Text" view, it seems to put r and r1 as columns in a table (something like this):
│r │r1 │
╞═══╪═══|
│{} │{} │
rather than as desired with each relationship on a different row, like:
Joe Smith | has_address | 1 Main Street
1 Main Street | has_state | NY
Joe Richards | has_address | 22 Main Street
I want to download this as a CSV file for filtering elsewhere. How do I re-write the query in Neo4J to get the desired result?
You may want to look at the Cypher cheat sheet, specifically the Relationship Functions.
That said, you have variables on all the nodes you need. You can output all the data you need on each row.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
RETURN p.name AS name, a.street AS address, s.name AS state
That should be enough.
What you seem to be asking for above is a way to union r and r1, but in such a way that they alternate in-order, one row being r and the next being its corresponding r1. This is a rather atypical kind of query, and as such there isn't a lot of support for easily making this kind of output.
If you don't mind rows being out of order, it's easy to do, but your start and end nodes for each relationship are no longer the same type of thing.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
WITH COLLECT(r) + COLLECT(r1) as rels
UNWIND rels AS rel
RETURN startNode(rel) AS start, type(rel) AS type, endNode(rel) as end
I'm hoping this diagram will be sufficient to explain what I'm after:
true
a--------------------b
| |
parent | | parent
| |
a_e------------------b_e
experimental
nodes a_e and b_e are experimental observations that each have only one parent, a and b, respectively. I know a true relationship exists between a and b, and I want to find cases where experimental relationships were observed between a_e and b_e. Among other things, I tried the following:
MATCH (n)-[:true]-(m)
WITH n,m
MATCH (n)-[:parent]-(i)
MATCH (m)-[:parent]-(j)
WITH i,j
OPTIONAL MATCH (i)-[r]-(j)
RETURN r
but this returns no rows. I'm thinking of this like a nested loop, matching all possible relationships between all i's and all j's. Is this type of query possible?
Something like
match (n)-[:true]-(m)
match (n)-[:parent]->(n_child)-[:experimental]-(m_child)<-[:PARENT]-(m)
return n_child,m_child
(not tested)
Assuming this is an example and you have labels etc. on your nodes.