How to create an ordered chain linked to a node? - neo4j

I have a set of HeadNodes which has field id and I have a set of TailNodes which are not related to each other and to HeadNodes and have fields id and date in milliseconds.
I want to write the query which takes:
Match (p: TailNodes) where not (p)-[:RELATED_TO]->()
that are not joined to HeadNode directly or through another TailNodes take their id number and look though HeadNodes for this id. When I found it (it's guaranteed to be there) I looked for a place to put it (in order of date time).
For example:
we have 1 HeadNode{id: 1} and 3 TailNodes: {id: 1, datetime: 111}, {id: 1, datetime: 115}, and {id: 1, datetime: 113} without any relationships.
At first step it takes first TailNode {id: 1, datetime: 111} and creates a relationship:
(head:HeadNode{id: 1})<-[:RELATED_TO]-(tail:TainNodes{id:1, datetime:111})
At second step it takes second Tailnode and finds out that 115 is greater than 111, so it deletes the previous relationship and creates 2 new relationships, and a chain that looks like this:
(head:HeadNode{id: 1})<-[:RELATED_TO]-(tail1:TainNodes{id:1, datetime:115})<-[:RELATED_TO]-(tail2:TainNodes{id:1, datetime:111})
At third step it founds out that 113 is greater than 111 but lesser than 115 and deletes relationship between datetime:115 and datetime:111; and then creates two new relationships finally getting the following:
(head:HeadNode{id: 1})<-[:RELATED_TO]-(tail1:TainNodes{id:1, datetime:115})<-[:RELATED_TO]-(tail2:TainNodes{id:1, datetime:113})<-[:RELATED_TO]-(tail3:TainNodes{id:1, datetime:111})
I hope it was clear explanation. Thanks in advance.

ok, first cut... out of time to create a more robust example but will take another shot later.
I started with a case where there were already nodes in teh list
H<--(T {dt:112})<--(T {dt:114})
I realize i create these in ascending order and not descending order too.
// match the orphaned tail nodes floating around
match (p:Tail)
where not(p-->())
with p
// match the strand with the same name and the tail nodes that are connected
// where one datetime (dt) is greater and one is less than my orphaned tail nodes
match (t1:Tail)<-[r:RELATED_TO]-(t2:Tail)
where t1.name = p.name
and t2.name = p.name
and t1.dt < p.dt
and t2.dt > p.dt
// break the relationship between the two nodes i want to insert between
delete r
// create new relationships from the orphan to the two previously connected tails
with t1, t2, p
create p-[:RELATED_TO]->t1
create t2-[:RELATED_TO]->p
return *
The case just needs to be extended for a tailless head and an orphan with a datetime greater than the last tail (i.e not in between two existing).

Related

Cypher not able to create edges in loop

I want to write query which will upsert(update/insert) one node and connect this node with related nodes.
I have some params for example like this
:params {namedEntities: [{"value": "value1","category": "category1",}, {"value": "value2", "category":"category2"}],
namedEntityAmounts: [2,3],
articleParams: {"id": "1","title": "3333"}
}
And using this params I want to do query like this:
MATCH (oldA:Article {id: $articleParams.id})
DETACH DELETE oldA
CREATE (a:Article)
SET a=properties($articleParams)
WITH a
UNWIND range(0, size($namedEntities)-1) as i
WITH $namedEntities[i] as nes, $namedEntityAmounts[i] as amount, a
MATCH (ne:NamedEntity {value: nes.value, category: nes.category})
CREATE (a)<-[r:OCCURS {amount: amount}]-(ne)
RETURN a,ne, type(r)
First four lines of this query are working. If query founds article with such a id, this article is being detach and delete and after this new node is being created with new params.
Hovewer I struggle with next part. Where I want to find all NamedEntity nodes whitch are identified by (value+category) and then I want to connect newly created Article node with all NamedEntity nodes. But this query seems to ingore last 3 lines and just prints out
Added 1 label, created 1 node, deleted 1 node, set 3 properties, completed after 6ms.
What I am doing wrong here?

neo4j - restrict query based on node's rank

I have a hierarchical structure of nodes, which all have a custom-assigned sorting property (numeric). Here's a simple Cypher query to recreate:
merge (p {my_id: 1})-[:HAS_CHILD]->(c1 { my_id: 11, sort: 100})
merge (p)-[:HAS_CHILD]->(c2 { my_id: 12, sort: 200 })
merge (p)-[:HAS_CHILD]->(c3 { my_id: 13, sort: 300 })
merge (c1)-[:HAS_CHILD]->(cc1 { my_id: 111 })
merge (c2)-[:HAS_CHILD]->(cc2 { my_id: 121 })
merge (c3)-[:HAS_CHILD]->(cc3 { my_id: 131 });
The problem I'm struggling with is that often I need to make decisions based on child node rank relative to some parent node, with regads to this sort identifier. So, for example, node c1 has rank 1 relative to node p (because it has the least sort property), c2 has rank 2, and c3 has rank 3 (the biggest sort).
The kind of decision I need to make based to this information: display children only of the first 2 cX nodes. Here's what I want to get:
cc1 and cc2 are present, but cc3 is not because c3 (its parent) is not the first or the second child of p. Here's a dumb query for that:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc) where c.sort <= 200
return p, c, cc
The problem is, these sort properties are custom-set and imported, so I have no way of knowing which value will be held for child number 2.
My current solution is to rank it during import, and since I'm using Oracle, that's quite simple -- I just need to use rank window function. But it seems awkward to me and I feel like there could be more elegant solution to that. I tried the next query and it works, but it looks weird and it's quite slow on bigger graphs:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc)
where size([ (p)-->(c1) where c1.sort < c.sort |c1]) < 2
return p, c, cc
Here's the plan for this query and the most expensive part is in fact the size expression:
The slowness you're seeing is likely because you're not performing an index lookup in your query, so it's performing an all nodes scan and accessing the my_id property of every node in your graph to find the one with id 1 (your p node).
You need to add labels on your nodes and use these labels in your queries (at least for your p node), and create an index (or in this case, probably a unique constraint) on the label for my_id so this lookup becomes fast.
You can confirm what's going on by doing a PROFILE of your query (if you can add the profile plan to your description, with all elements of the plan expanded that would help determine further optimizations).
As for your query, something like this should work (I'm using a :Node label as a standin for your actual label)
match (p:Node {my_id: 1 })-->(c)
with p, c
order by c.sort asc
with p, collect(c) as children // children are in order
unwind children[..2] as child // one row for each of the first 2 children
optional match (child)-->(cc) // only matched for the first 2 children
return p, children, collect(cc) as grandchildren
Note that this only returns nodes, not paths or relationships. The reason why you're getting the result graph in the graphical view is because, in the Browser Setting tab (the gear icon in the lower left menu) you have Connect result nodes checked at the bottom.

How to create relationship based on common Epochtime property

I am trying to do a model for state changes of a batch. I capture the various changes and I have an Epoch time column to track these. I managed to get this done with the below code :
MATCH(n:Batch), (n2:Batch)
WHERE n.BatchId = n2.Batch
WITH n, n2 ORDER BY n2.Name
WITH n, COLLECT(n2) as others
WITH n, others, COALESCE(
HEAD(FILTER(x IN others where x.EpochTime > n.EpochTime)),
HEAD(others)
) as next
CREATE (n)-[:NEXT]->(next)
RETURN n, next;
It makes my graph circular because of the HEAD(others) and doesn't stop at the Node with the maximum Epoch time. If I remove the HEAD(others) then I am unable to figure out how to stop the creation of relationship for the last node. Not sure how to put conditions around the creation of relationship so I can stop creating relationships when the next node is null
This might do what you want:
MATCH(n:Batch)
WITH n ORDER BY n.EpochTime
WITH n.BatchId AS id, COLLECT(n) AS ns
CALL apoc.nodes.link(ns, 'NEXT')
RETURN id, ns;
It orders all the Batch nodes by EpochTime, and then collects all the Batch nodes with the same BatchId value. For each collection, it calls the apoc procedure apoc.nodes.link to link all its nodes together (in chronological order) with NEXT relationships. Finally, it returns each distinct BatchId and its ordered collection of Batch nodes.

Create relationships from sequence of events

I have a CSV with log of events that has following columns: EventType, UserId, RecordId (an auto-incremented sequence number). I want to import to Neo4j and build a node for every EventType (around 100 unique types) and then analyze paths using relationships. To build relationship I need to match all raw events and find the "next" event in the path, which means I need to match it with event that has same UserId and next RecordId is larger than the current RecordId (next RecordId > current RecordId).
What is the efficient way to do this in Cypher? Somehow I come up with queries that involve a Cartesian product, which are very slow.
I think you cannot avoid Cartesian products in this case. However, you can
Make them as small as possible.
Use indexing to improve the speed of your queries.
Besides using EventType as a node label ("unique type"), I strongly recommend to use an additional Event label for all events so that you can index the userId value and recordId values.
CREATE INDEX ON :Event(recordId)
CREATE INDEX ON :Event(userId)
I created a small example data set:
CREATE
(e1:Event:Skating {userId: 1, recordId: 1}),
(e2:Event:Hiking {userId: 1, recordId: 2}),
(e3:Event:Mountaineering {userId: 1, recordId: 3})
To get the next recordId, you need to satisfy that nextRecordId > currentRecordId and also the nextRecordId must be the smallest one (as the recordId comes from an auto-incremented sequence). We than connect the two events using MERGE (CREATE also works, but using MERGE makes sure that we avoid creating duplicate edges). This gives the following query:
MATCH (a:Event), (b:Event)
WHERE a.userId = b.userId
AND a.recordId < b.recordId
WITH a, min(b.recordId) AS bRecordId
MATCH (b {recordId: bRecordId})
MERGE (a)-[:NEXT]->(b)
This query creates a Cartesian product for all user ids. As long as the users do not participate in hundreds of events, the size of the Cartesian products should not grow huge. Note that the first MATCH uses both indices (userId and recordId), while the second MATCH uses the index on recordId.

Error creating relationships over huge dataset

My question is similar to the one pointed here :
Creating unique node and relationship NEO4J over huge dataset
I have 2 tables Entity (Entities.txt) & Relationships (EntitiesRelationships_Updated.txt) which looks like below: Both the tables are inside an import folder within the Neo4j database. What I am trying to do is load the tables using the load csv command and then create relationships.
As in the table below: If ParentID is 0, it means that ENT_ID does not have a parent. If it is populated, then it has a parent. For example in the table below, ENT_ID = 3 is the parent of ENT_ID = 4 and ENT_ID = 1 is the parent of ENT_ID = 2
**Entity Table**
ENT_ID Name PARENTID
1 ABC 0
2 DEF 1
3 GHI 0
4 JKG 3
**Relationship Table**
RID ENT_IDPARENT ENT_IDCHILD
1 1 2
2 3 5
The Entity table has 2 million records and the relationship tables has about 400K lines
Each RID has a particular tag associated with it. For example RID = 1 has it that the relation is A FATHER_OF B; RID = 2 has it that the relation is A MOTHER_OF B. Similarly there are 20 such RIDs associated.
Both of these are in txt format.
My first step is to load the entity table. I used the following script:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///Entities.txt" AS Entity FIELDTERMINATOR '|'
CREATE (n:Entity{ENT_ID: toInt(Entity.ENT_ID),NAME: Entity.NAME,PARENTID: toInt(Entity.PARENTID)})
This query works fine. It takes about 10 minutes to load 2.8mil records. The next step I do is to index the records:
CREATE INDEX ON :Entity(PARENTID)
CREATE INDEX ON :Entity(ENT_ID)
This query runs fine as well. Following this I tried creating the relationships from the relationship table using a similar query as in the above link:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (n:A {ENT_IDPARENT : Rships.ENT_IDPARENT})
with Entity, n
MATCH (m:B {ENT_IDCHILD : Rships.ENT_IDCHILD})
with m,n
MERGE (n)-[r:RELATION_OF]->(m);
As I do this, my query keeps running for about an hour and it stops at a particular size(in my case 2.2gb) I followed this query based on the link above. This includes the edit from the solution below and still does not work
I have one more query, which would be as follows (Based on the above link). I run this query as I want to create a relationship based of the Entity table
PROFILE
MATCH(Entity)
MATCH (a:Entity {ENT_ID : Entity.ENT_ID})
WITH Entity, a
MATCH (b:Entity {PARENTID : Entity.PARENTID})
WITH a,b
MERGE (a)-[r:PARENT_OF]->(b)
While I tried running this query, I get a Java Heap Space Error. Unfortunately, I have not been able to get the solution for these.
Could you please advice if I am doing something wrong?
This query allows you to take advantage of your :Entity(ENT_ID) index:
MATCH (child:Entity)
WHERE child.PARENTID > 0
WITH child.PARENTID AS pid, child
MATCH (parent:Entity {ENT_ID : pid})
MERGE (parent)-[:PARENT_OF]->(child);
Cypher does not use indices when the property value comes from another node. To get around that, the above query uses a WITH clause to represent child.PARENTID as a variable (pid). The time complexity of this query should be O(N). You original query has a complexity of O(N * N).
[EDITED]
If the above query takes too long or encounters errors that might be related to running out of memory, try this variant, which creates 1000 new relationships at a time. You can change 1000 to any number that is workable for you.
MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT 1000
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);
The WHERE clause filters out child nodes that already have a parent relationship. And the MERGE operation has been changed to a simpler CREATE operation, since we have already ascertained that the relationship does not yet exist. The query returns a count of the number of relationships created. If the result is less than 1000, then all parent relationships have been created.
Finally, to make the repeated queries automated, you can install the APOC plugin on the neo4j server and use the apoc.periodic.commit procedure, which will repeatedly invoke a query until it returns 0. In this example, I use a limit parameter of 10000:
CALL apoc.periodic.commit(
"MATCH (child:Entity)
WHERE child.PARENTID > 0 AND NOT ()-[:PARENT_OF]->(child)
WITH child.PARENTID AS pid, child
LIMIT {limit}
MATCH (parent:Entity {ENT_ID : pid})
CREATE (parent)-[:PARENT_OF]->(child)
RETURN COUNT(*);",
{limit: 10000});
Your entity creation Cypher looks fine, as do your indexes.
I am rather confused about the last two Cypher fragments though.
Since your relationships have a specific label or id associated with them, it's probably best to add your relationships by loading from the relationship table data, though the node labels in your query (A and B) aren't used in your Entity creation and aren't in your graph, and neither are ENT_IDPARENT or ENT_IDCHILD fields. Looks like this isn't really the Cypher you used, but an example you built off of?
I'd change this relationship creation query to this, setting the type property of the relationship for post-processing later (this assumes that there can only be one :RELATION_OF relation between the same two nodes):
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///EntitiesRelationships_Updated.txt" AS Rships FIELDTERMINATOR '|'
MATCH (parent:Entity {ENT_ID : Rships.ENT_IDPARENT})
MATCH (child:Entity {ENT_ID : Rships.ENT_IDCHILD})
MERGE (parent)-[r:RELATION_OF]->(child)
ON CREATE SET r.RID = Rships.RID;
Later on, if you like, you can match on your relationships with an RID, and add the corresponding type ("FATHER_OF", "MOTHER_OF", etc) property.
As for creating the :PARENT_OF relationship, you're doing some extra match on an Entity variable bound to every single node in your graph - get rid of that.
Instead, use this:
PROFILE
// first, match on all Entities with a PARENTID property
MATCH(child:Entity)
WHERE EXISTS(child.PARENTID)
// next, find the parent for each child by the child's PARENTID
WITH child
MATCH (parent:Entity {ENT_ID : child.PARENTID})
MERGE (parent)-[:PARENT_OF]->(child)
// lastly remove the parentid from the child, so it won't be reprocessed
// if we run the query again.
REMOVE child.PARENTID
EDITED the above query to use an existence check on child.PARENTID, and to remove child.PARENTID after the corresponding relationship has been created.
If you need a solution that uses batching, you could do this manually (adding LIMIT 100000 to your WITH child line, or you could install the APOC Procedures Library and use its periodic.commit() function to batch your processing.

Resources