how to remove property duplicates inside Neo4j - neo4j

I'm new at Neo4j and i'm trying to learn it in order to start a project for my thesis, i spent a lot of time to manually reorganize a large dataset from xml to csv, then i imported a piece of this db inside Neo4j, i would avoid those duplicates but i canno't find a proper query, can you help me?
Thanks!
Please tell me if you need to know anything about this database or other stuff to undestand better the situation.

Here is one way to ensure that the dedicateePersonId list has distinct values:
MATCH (p:Person)
UNWIND p.dedicateePersonId AS id
WITH p, COLLECT(DISTINCT id) AS ids
SET p.dedicateePersonId = ids
[UPDATE]
The above approach is sufficient if there was only one property that needs adjustment but very inappropriate (too complex and inefficient) for multiple properties.
Here is a much better approach (using the functions apoc.create.setProperty and apoc.coll.toSet) to update the values of any number of property keys. keys is assumed to be parameter containing the names of the properties to be updated.
MATCH (p:Person)
UNWIND $keys AS k
CALL apoc.create.setProperty(p, k, apoc.coll.toSet(p[k])) YIELD node
RETURN DISTINCT node

Related

Cypher: Avoid cycle in multiple relationship paths

My website has pages dedicated to some events, represented by nodes in neo4j. Those events possess sub-events which are relationships under neo4j, and which correspond to links on the source page to the target page. I currently have a search engine that highlights the links to the searched events, but it is flawed by cycles in the data model. Indeed it is highlighting all the links that contain cyclic references to the current page if this page contains any link to the searched event.
The aim is therefore to have a query that is able to flag the nodes and relationships which are related to the searched events, without flagging a path only because of cyclic relationships.
I've created a small dataset representative of the issue that you can build using this query:
CREATE
(r:Event:Searched {name:'R', tag:1}),
(d:Event:Searched {name:'D', tag:1}),
(o:Event {name:'O'}),
(a:Event {name:'A'}),
(b:Event {name:'B'}),
(c:Event {name:'C'}),
(e:Event {name:'E'}),
(o)-[:hasEvent]->(a),
(o)-[:hasEvent]->(e),
(o)-[:hasEvent]->(r),
(o)-[:hasEvent]->(c),
(a)-[:hasEvent]->(b),
(b)-[:hasEvent]->(o),
(c)-[:hasEvent]->(d)
Which produces the following graph:
My aim is to have a query that only fetches nodes C and O, but not A or B, as the only reason they are flagged is that O is already flagged:
.
My current query that I need to fix is the following:
MATCH path=(upper:Event)-[:hasEvent*]->(source:Event:Searched)
RETURN upper
I hope you can help me, I couldn't manage to make similar questions' answers work on my specific case.
Ideally, the solution shouldn't be too computing-intensive, as my real model is quite big (2.300.000 nodes and 9.500.000 relationships), and the current indexing in the search engine is already quite slow.
Thank you in advance for your help
You could try out apoc.path.expandConfig procedure. It has a uniqueness property, which you can be configured that a node cannot be traversed more than once.
MATCH (n:Event)
CALL apoc.path.expandConfig(n, {
relationshipFilter: "hasEvent>",
labelFilter: "/Searched",
uniqueness: "NODE_GLOBAL"
}) YIELD path
RETURN [n IN nodes(path) WHERE NOT n:Searched] AS upper
However, this query will still return A and B, because with MATCH (n:Event) you start looking from every node (regardless of the relationship between O and A). But if I understood you correctly, you don't wan't to start from all nodes, but from a specific one ("pages dedicated to some events"). So you might want to start with, e.g., MATCH (n:Event {name: "O"}) that will return only O and C.

How to move all neo4j relationships with all their labels and properties from one node to another?

Suppose you've got two nodes that represent the same thing, and you want to merge those two nodes. Both nodes can have any number of relations with other nodes.
The basics are fairly easy, and would look something like this:
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->()
CREATE (a)-[s]->()
SET s = PROPERTIES(r)
DELETE DETACH b
Only I can't create a relation without a type. And Cypher doesn't support variable labels either. I'd love to be able to do something like
CREATE (a)-[s:{LABELS(r)}]->(o)
but that doesn't work. To create the relation, you need to know the type of the relation, and in this case I really don't.
Is there a way to dynamically assign types to relationships, or am I going to have to query the types of the old relation, and then string concat new queries with the proper types? That's not impossible, but a lot slower and more complex. And this could potentially match a lot of elements and even more relationships, so having to generate a separate query for every instance is going to slow things down quite a lot.
Or is there a way to change the target of the old relationship? That would probably be the fastest, but I'm not aware of any way to do that.
I think you need to take a look at APOC, especially apoc.create.relationship which enable creating relationships with dynamic type.
Adapting your example, you should end up with something along the line of (not tested):
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->(n)
CALL apoc.create.relationship(a, type(r), properties(r), n)
DETACH DELETE b
NB
relationships have TYPE and not label
the proper cypher statement to delete relationships attached to a node and the node itself is DETACH DELETE (and not DELETE DETACH)
Related resource: https://markhneedham.com/blog/2016/10/30/neo4j-create-dynamic-relationship-type/
The APOC procedure apoc.refactor.mergeNodes should be very helpful. That procedure is very powerful, and you need to read the documentation to understand how to configure it to do what you want in your specific situation.
Here is a simple example that shows how to use the procedure's default configuration to merge nodes with the same id:
MATCH (node:Foo)
WITH node.id AS id, COLLECT(node) AS nodes
WHERE SIZE(nodes) > 1
CALL apoc.refactor.mergeNodes(nodes, {}) YIELD node
RETURN node
In this example, I specified an arbitrary Foo label to avoid accidentally merging unwanted nodes. Doing so also helps to speed up the query if you have a lot of nodes with other labels (since they will not need to be scanned for the id property).
The aggregating function COLLECT is used to collect a list of all the nodes with the same id. After checking the size of the list, it is passed to the procedure.

Back-reference neo4j

Can I use some back-reference sort of mechanism in neo4j? I'm not interested in what matches the query, just that it is the same thing in many places. Something like:
MATCH (a:Event {diagnosis1:11})
MATCH (b:Event {diagnosis1:15})
MATCH (c:Event {diagnosis1:5})
MATCH (a)-[rel:Next {PatientID:*}]->(b)
MATCH (b)-[rel1:Next {PatientID:\{1}]->(c)
The idea is that I just require the attribute IDs from both edges to be the same, without specifying it. The whole purpose of it would be not generating all possible matchings, to then filter them, but only hop in the specific places.
I've asked something similar in a more specific way here.
Edit: I know WHERE clauses can be used for that, but they filter the query AFTER matching the edges and nodes. I want to do that DURING the matching!
Use a WHERE clause with simple references, there's no need for back references:
MATCH (a)-[rel:Next]->(b)
MATCH (b)-[rel1:Next]->(c)
WHERE rel.PatientID = rel1.PatientID
Update
First of all, Cypher is a declarative query language: you express what you want, the runtime takes care of executing and optimizing it any way it can, so it's not that obvious that it would do it the way you think it will, or that using "back references" would magically solve the problem; it's just another way of writing the same thing.
So, your problem is that the match creates all the relationship pairs before filtering them. How about splitting the match in 2 phases using WITH?
MATCH (a:Event {diagnosis1:11})-[rel:Next]->(b:Event {diagnosis1:15})
WITH a, b, rel
MATCH (b)-[rel1:Next]->(c:Event {diagnosis1:5})
WHERE rel1.PatientID = rel.PatientID
That should only select the second relationships that match the first, but I'm not sure if it's an O(n^2) algorithm in Cypher's runtime.
Otherwise, if you drop to the Java API (which would mean either an extension or a procedure, depending on your version of Neo4j), you can probably implement in O(n) by
scanning all the relationships between a and b, indexing them by PatientID in some multimap (see Guava, or use a Map<K, Collection<V>>); this is O(n)
then doing the same for all the relationships between b and c, still O(n)
iterate on the keys of one multimap to get the values in both and match them, still O(n)

Cypher: Ordering Nodes on Same Level by Property on Relationship

I am new to Neo4j and currently playing with this tree structure:
The numbers in the yellow boxes are a property named order on the relationship CHILD_OF.
My goal was
a) to manage the sorting order of nodes at the same level through this property rather than through directed relationships (like e.g. LEFT, RIGHT or IS_NEXT_SIBLING, etc.).
b) being able to use plain integers instead of complete paths for the order property (i.e. not maintaining sth. like 0001.0001.0002).
I can't however find the right hint on how or if it is possible to recursively query the graph so that it keeps returning the nodes depth-first but for the sorting at each level consider the order property on the relationship.
I expect that if it is possible it might include matching the complete path iterating over it with the collection utilities of Cypher, but I am not even close enough to post some good starting point.
Question
What I'd expect from answers to this question is not necessarily a solution, but a hint on whether this is a bad approach that would perform badly anyways. In terms of Cypher I am interested if there is a practical solution to this.
I have a general idea on how I would tackle it as a Neo4j server plugin with the Java traversal or core api (which doesn't mean that it would perform well, but that's another topic), so this question really targets the design and Cypher aspect.
This might work:
match path = (n:Root {id:9})-[:CHILD_OF*]->(m)
WITH path, extract(r in rels(path) | r.order) as orders
ORDER BY orders
if it complains about sorting arrays then computing a number where each digit (or two digits) are your order and order by that number
match path = (n:Root {id:9})-[:CHILD_OF*]->(m)
WITH path, reduce(a=1, r in rels(path) | a*10+r.order) as orders
ORDER BY orders

Cypher query optimisation - Utilising known properties of nodes

Setup:
Neo4j and Cypher version 2.2.0.
I'm querying Neo4j as an in-memory instance in Eclipse created TestGraphDatabaseFactory().newImpermanentDatabase();.
I'm using this approach as it seems faster than the embedded version and I assume it has the same functionality.
My graph database is randomly generated programmatically with varying numbers of nodes.
Background:
I generate cypher queries automatically. These queries are used to try and identify a single 'target' node. I can limit the possible matches of the queries by using known 'node' properties. I only use a 'name' property in this case. If there is a known name for a node, I can use it to find the node id and use this in the start clause. As well as known names, I also know (for some nodes) if there are names known not to belong to a node. I specify this in the where clause.
The sorts of queries that I am running look like this...
START
nvari = node(5)
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvari:C4)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION),
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION),
WHERE
NOT(nvarj.Name IN ['nf']) AND NOT(nvarm.Name IN ['nb','nj'])
RETURN DISTINCT target
Another way to think about this (if it helps), is that this is an isomorphism testing problem where we have some information about how nodes in a query and target graph correspond to each other based on restrictions on labels.
Question:
With regards to optimisation:
Would it help to include relation variables in the match clause? I took them out because the node variables are sufficient to distinguish between relationships but this might slow it down?
Should I restructure the match clause to have match/where couples including the where clauses from my previous example first? My expectation is that they can limit possible bindings early on. For example...
START
nvari = node(5)
MATCH
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarj.Name IN ['nf'])
MATCH
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarm.Name IN ['nb','nj'])
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION)
RETURN DISTINCT target
On the side:
(Less important but still an interest) If I make each relationship in a match clause an optional match except for relationships containing the target node, would cypher essentially be finding a maximum common sub-graph between the query and the graph data base with the constraint that the MCS contains the target node?
Thanks a lot in advance! I hope I have made my requirements clear but I appreciate that this is not a typical use-case for Neo4j.
I think querying with node properties is almost always preferable to using relationship properties (if you had a choice), as that opens up the possibility that indexing can help speed up the query.
As an aside, I would avoid using the IN operator if the collection of possible values only has a single element. For example, this snippet: NOT(nvarj.Name IN ['nf']), should be (nvarj.Name <> 'nf'). The current versions of Cypher might not use an index for the IN operator.
Restructuring a query to eliminate undesirable bindings earlier is exactly what you should be doing.
First of all, you would need to keep using MATCH for at least the first relationship in your query (which binds target), or else your result would contain a lot of null rows -- not very useful.
But, thinking clearly about this, if all the other relationships were placed in separate OPTIONAl MATCH clauses, you'd be essentially saying that you want a match even if none of the optional matches succeeded. Therefore, the logical equivalent would be:
MATCH (target:C5)-[:IN_LOCATION]->(nvara:LOCATION)
RETURN DISTINCT target
I don't think this is a useful result.

Resources