I've got a graph consisting of nodes representing document versions connected together in path of versions. These paths can be connected by another type of relationships that represent a change in the way documents were versionned. One of the problem of the graph is that the sources used to create it where not really clean, that's why I'm trying to write a query that would add a relationship to have a clean path of versions in the graph.
The part I'm stuck on is the following :
Say that I've got two paths of nodes from two different versioning periods. These paths are connected together by one or multiple relationship from the second type that indicate a port of the document to the new system. I want a query that will take the last one satisfying some conditions in the old path and connect it to the first one satisfyng some other conditions in the new path.
For example in the following graph i would want to connect (D) to (2) because (1) does not satisfy my set of conditions:
(A)-[:Version]->(B)-[:Version]->(C)-[:Version]->(D)
| |
Ported Ported
| |
(1)-[:Version]->(2)-[:Version]->(3)
I came up with different queries but all of them fails in some cases :
This one fails because sometimes old documents where ported and split into multiple documents, meaning different path but my query select only one new 'new' node for one 'old' one thus ignoring some paths.
//match all the 'port' and 'ported' relations between old and new versioning system
match (new:Document)-[r:Link]-(old:Document)
where new.num =~'[A-Z].{4}-.*' and old.num =~'[A-Z].{3}-.*' and r.type in ['PORT','PORTED']
//find youngest one satisfying a condition, here a date
optional match(new)<-[:Version*]-(newAncestor:ArticleCode)
where newAncestor.dateBegin >= '2012-01-01'
with old, collect(new) + collect(newAncestor) as potentialNewVersions
unwind potentialNewVersions as potentialNew
with distinct old, potentialNew
order by potentialNew.dateBegin , potentialNew.dateEnd
with distinct old, collect(potentialNew)[0] as youngestNew
//find oldest one satisfying a condition
optional match(old) -[:Version *]->(oldChild:ArticleCode)
where oldChild.dateEnd <= youngestNew.dateBegin
with old, youngestNew, collect(old) + collect(oldChild) as potentialOldVersions
unwind potentialOldVersions as potentialOld
with distinct old, youngestNew, potentialOld
order by potentialOld.dateEnd desc, potentialOld.dateBegin desc
with distinct youngestNew, collect(potentialOld)[0] as oldestOld
merge(youngestNew)<-[:VersionGlobal]-(oldestOld)
The second one is much simpler but select too much nodes for the 'new' ones as multiple version can satisfy the date condition. In addition it could fail if the only 'ported' relationship between the old and new path was on a node before the limit date.
//this time I match all path of new versions whose first node satisfy condition
match p=(new:Document)-[:Version*0..]->(:Document)-[r:Link]-(old:ArticleCode)
where new.num =~'[A-Z].{4}-.*' and old.num =~'[A-Z].{3}-.*' and r.type in ['PORT','PORTED'] and new.dateBegin >= '2012-01-01'
//take first node of each path
with distinct nodes(p)[0] as youngestNew, old
//find latest old node
optional match p=(old)-[:Version*0..]->(oldChild:ArticleCode)
where oldChild.dateFin <= youngestNew.dateDebut
with distinct last(nodes(p)) as oldestOld, old
merge(youngestNew)<-[:VersionGlobal]-(oldestOld)
Thanks
I think we found an answer using optional matches and cases :
match (new:Document)-[:Version*0..]-(:Document)-[r:Lien]-(:Document)-[:Version*0..]-(old:Document)
where *myConditions*
optional match (newAncestor:Document)-[:Version]->(new)
with distinct
case
when newAncestor.dateBegin < '2012-01-01' or newAncestor is null
then new
end as youngestNew, old
where not(youngestNew is null)
optional match (old)-[:Version]->(oldChild:Document)
with distinct
youngestNew,
case
when oldChild.dateBegin > youngestNew.dateBegin or oldChild is null
then old
end as oldestOld
where not(oldestOld is null)
*merge part*
Related
I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes
We create multiple graphs based on versions of a program (A,B in my example)
(:ProgNode {compileUnit:RL105A, nodeKey:100, captureDate:1/1/1} )
(:ProgNode {compileUnit:RL105B}, nodeKey:200,captureDate:2/2/2} )
these fan out into full blown graphs with thousands of nodes. We also have a single node (:ProgUnit{compileUnit:RL105})
that is a "master" node for that program. We want to link the first node of each individual subgraph (the lowest nodeKey ) to the master. My current query looks like this
MATCH (p:ProgNode) where p.compileUnit = 'RL105A' WITH min(p.nodeKey) as low_node
Match (j:ProgUnit) where j.compileUnit = 'RL105'
Create (j)-[r:RELEASE]->(p)
A and B will eventually be dates but for now, letters
This works (sort of), but instead of linking the master to the subgraph, it seems to create a new node which isn't anything.
I know I will have to run this 2 times to build both links (A,B) and thats not an issue.
Thoughts ? What am I doing wrong here ?
Your WITH clause did not include p as a term, so p became an unbound variable again.
The following query should create a RELEASE relationship to the ProgNode whose compileUnit starts with "RL105" and whose nodeKey has the lowest value:
MATCH (p:ProgNode) WHERE p.compileUnit STARTS WITH 'RL105'
WITH p ORDER BY p.nodeKey LIMIT 1
MATCH (j:ProgUnit) WHERE j.compileUnit = 'RL105'
CREATE (j)-[:RELEASE]->(p)
Use MERGE instead of CREATE if you need to avoid creating duplicate relationships.
I have a graph where some nodes were created out of an error in the app.
I want to delete those nodes (they represent a log), but I can't figure out how to loop thru the nodes.
I don't know how to access nodes in a collection of paths, and I need to do that in order to compare one node to another.
match (o:Order{id:123})
match (o)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
with collect((l:Log)-[:STATUS]->(os:OrderStatus)) as logs
I want to access each one of the nodes in the paths to perform a comparation. There are 5 or 6 of (l)-[:STATUS]->(os) normally for each Order.
How can I access the (l) and (os) nodes of each path, to perform the comparations between their properties?
For example, if I had this collection of paths in one of the Orders:
(log1)-[:STATUS]->(os1)
(log2)-[:STATUS]->(os2)
(log3)-[:STATUS]->(os3)
(log4)-[:STATUS]->(os2) <-- This is the error
(log5)-[:STATUS]->(os4)
So, from the collection of paths above, I'd want to detach delete the (log4), because the (os2) node is lower than the previous one (os3), and should be greater.
And after that, I want to attach the (log3) to the (log5)
NOTE: Each one of the (os) nodes has an id that represents the "status", and go from 1 to 5. Also, the (log) nodes are ordered by the created datetime.
Any idea on how to do this? Thank you in advance guys!
EDIT
I didn't mention some other scenarios I had. This is one of them:
Based on #cybersam answer, I found out how to work it out.
I had to run 2 separated queries to make it work, but the principle is the same, and is as follows:
Create new relationships:
MATCH(o:Order)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE SIZE((o)-[:STATUS_CHANGE*]->()-[:STATUS]->(os)) >= 1
WITH o, os, COLLECT(l)[0] AS keep
WITH o, collect(keep) AS k
FOREACH(i IN range(0,size(k)-1) |
FOREACH(a IN [k[i]] |
FOREACH(b IN [k[i+1]] |
FOREACH(c IN CASE WHEN b IS NOT NULL THEN [1] END | MERGE (a)-[:STATUS_CHANGE]->(b) ))));
Delete exceeded nodes:
MATCH(o:Order)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE (os)<-[:STATUS]-()-[:STATUS_CHANGE*]->(l)-[:STATUS]->(os)
WITH o, os, COLLECT(l) AS exceed
UNWIND exceed AS del
detach delete del;
This queries worked on every scenario.
Assuming all your errors follow the same pattern (the unwanted Log nodes are always referencing an "older" OrderStatus), this may work for you:
MATCH (o:Order{id:123})-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE SIZE(()-[:STATUS]->(os)) > 1
WITH os, COLLECT(l) AS logs
UNWIND logs[1..] AS unwanted
OPTIONAL MATCH (x)-[:STATUS_CHANGE]->(unwanted)-[:STATUS_CHANGE]->(y)
DETACH DELETE unwanted
FOREACH(ignored IN CASE WHEN x IS NOT NULL THEN [1] END | CREATE (x)-[:STATUS_CHANGE]->(y))
This query:
Finds (in order) all relevant OrderStatus nodes having multiple STATUS relationships.
Uses the aggregating function COLLECT to collect (in order) the Log nodes related to each of those OrderStatus nodes.
Uses UNWIND logs[1..] to get the individual unwanted Log nodes.
Uses OPTIONAL MATCH to get the 2 nodes that may need to be connected together, after the unwanted node is deleted.
Uses DETACH DELETE to deleted each unwanted node and its relationships.
Uses FOREACH to connect together the pair of nodes that might have been foiund by the OPTIONAL MATCH.
We are in a POC for Neo4j. The use case is a dashboard where we only bring back opportunities for a seller that they are qualified for and have not already taken an action on. Currently there are 3 criteria and we are looking to add two more. The corresponding SQL is 3 pages so we are looking at a better way as when we add the next criteria, 2 more nodes paths in Neo, will be a bear in SQL. When I run the query below I get back a different amount of rows than the SQL. the buys returned must be at the end of all 3 paths and not be in the 4th. I hope you can point out where I went wrong. If this is a good query then I have a data problem.
Here is the query:
//oportunities dashboard
MATCH (s:SellerRep)-[:SELLS]->(subCat:ProductSubCategory)<-[:IS_FOR_SUBCAT]-(b:Buy)
MATCH (s:SellerRep)-[:SELLS_FOR]->(o:SellerOrg)-[:HAS_SELLER_TYPE]->(st:SellerType)<-[:IS_FOR_ST]-(b:Buy)
MATCH (s:SellerRep)-[:SELLS_FOR]->(o:SellerOrg)-[:IS_IN_SC]->(sc:SellerCommunity)<-[:IS_FOR_SC]-(b:Buy)
WHERE NOT (s:SellerRep)-[:PLACED_BID]->(:Bid)-[:IS_FOR_BUY]->(b:Buy)
AND s.sellerRepId = 217722 and b.currBuyStatus = 'Open'
RETURN b.buyNumber, b.buyDesc, st.sellerType, sc.communtiyName, subCat.subCategoryName+' - '+subCat.desc as sub_cat
If it helps, here is the data model:
POC Data model
Thanks for any help.
A WHERE clause only filters the immediately preceding MATCH clause.
Since you placed your WHERE clause after the third MATCH clause, the first 2 MATCH clauses are not bound to a specific SellerRep or Buy node, and are therefore bringing in more ProductSubCategory and SellerType nodes than you intended.
The following query is probably closer to what you intended:
MATCH (s:SellerRep)-[:SELLS]->(subCat:ProductSubCategory)<-[:IS_FOR_SUBCAT]-(b:Buy)
WHERE s.sellerRepId = 217722 AND b.currBuyStatus = 'Open' AND NOT (s:SellerRep)-[:PLACED_BID]->(:Bid)-[:IS_FOR_BUY]->(b:Buy)
MATCH (s)-[:SELLS_FOR]->(o:SellerOrg)-[:HAS_SELLER_TYPE]->(st:SellerType)<-[:IS_FOR_ST]-(b)
MATCH (o)-[:IS_IN_SC]->(sc:SellerCommunity)<-[:IS_FOR_SC]-(b)
RETURN b.buyNumber, b.buyDesc, st.sellerType, sc.communtiyName, subCat.subCategoryName+' - '+subCat.desc as sub_cat
NOTE: Your second and third MATCH clauses both started with (s:SellerRep)-[:SELLS_FOR]->(o:SellerOrg). I simplified the same logic by having my third MATCH clause just start with (o). Hopefully, you actually intended to force both clauses to refer to the same SellerOrg node.
I have a Neo4J DB up and running with currently 2 Labels: Company and Person.
Each Company Node has a Property called old_id.
Each Person Node has a Property called company.
Now I want to establish a relation between each Company and each Person where old_id and company share the same value.
Already tried suggestions from: Find Nodes with the same properties in Neo4J and
Find Nodes with the same properties in Neo4J
following the first link I tried:
MATCH (p:Person)
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
resulting in no change at all and as suggested by the second link I tried:
START
p=node(*), c=node(*)
WHERE
HAS(p.company) AND HAS(c.old_id) AND p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN p, c;
resulting in a runtime >36 hours. Now I had to abort the command without knowing if it would eventually have worked. Therefor I'd like to ask if its theoretically correct and I'm just impatient (the dataset is quite big tbh). Or if theres a more efficient way in doing it.
This simple console shows that your original query works as expected, assuming:
Your stated data model is correct
Your data actually has Person and Company nodes with matching company and old_id values, respectively.
Note that, in order to match, the values must be of the same type (e.g., both are strings, or both are integers, etc.).
So, check that #1 and #2 are true.
Depending on the size of your dataset you want to page it
create constraint on (c:Company) assert c.old_id is unique;
MATCH (p:Person)
WITH p SKIP 100000 LIMIT 100000
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN count(*);
Just increase the skip value from zero to your total number of people in 100k steps.