We create multiple graphs based on versions of a program (A,B in my example)
(:ProgNode {compileUnit:RL105A, nodeKey:100, captureDate:1/1/1} )
(:ProgNode {compileUnit:RL105B}, nodeKey:200,captureDate:2/2/2} )
these fan out into full blown graphs with thousands of nodes. We also have a single node (:ProgUnit{compileUnit:RL105})
that is a "master" node for that program. We want to link the first node of each individual subgraph (the lowest nodeKey ) to the master. My current query looks like this
MATCH (p:ProgNode) where p.compileUnit = 'RL105A' WITH min(p.nodeKey) as low_node
Match (j:ProgUnit) where j.compileUnit = 'RL105'
Create (j)-[r:RELEASE]->(p)
A and B will eventually be dates but for now, letters
This works (sort of), but instead of linking the master to the subgraph, it seems to create a new node which isn't anything.
I know I will have to run this 2 times to build both links (A,B) and thats not an issue.
Thoughts ? What am I doing wrong here ?
Your WITH clause did not include p as a term, so p became an unbound variable again.
The following query should create a RELEASE relationship to the ProgNode whose compileUnit starts with "RL105" and whose nodeKey has the lowest value:
MATCH (p:ProgNode) WHERE p.compileUnit STARTS WITH 'RL105'
WITH p ORDER BY p.nodeKey LIMIT 1
MATCH (j:ProgUnit) WHERE j.compileUnit = 'RL105'
CREATE (j)-[:RELEASE]->(p)
Use MERGE instead of CREATE if you need to avoid creating duplicate relationships.
Related
I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes
I intended to clone a single node and its 3 connections, but ended up with multiple clones.
By first MATCHing the entire graph of primary node and related nodes, when I call apoc.refactor.cloneNodes, it seems to iterate over each related node instead of just the primary node I want to clone. Result is the original primary node and 3 clones (instead of the intended 1 clone) connected to the expected related nodes.
. . .
I created this toy graph:
create (a:Node {description:"Spider Man Series"})
create (b:Node {description:"Spidey"})
create (c:Node {description:"Doc Oc"})
create (d:Node {description:"Venom"})
create (a)-[:BELONGS]->(b)
create (a)-[:BELONGS]->(c)
create (a)-[:BELONGS]->(d)
return a,b,c,d
I want to clone "Spider Man Series" (and its relationships):
match (a)-[c]-(b)
where a.description="Spider Man Series"
call apoc.refactor.cloneNodes([a],true) yield output
return a,b,c, output
But this creates 3 clones (one for each related character node). I'm guessing it has to do something with the MATCH having a relationship.
Because if I just limit my MATCH with no relationships, I get the proper clone behavior (the original "Spider Man Series" and the clone "Spider Man Series" with cloned relationships). I'm confused because there's only 1 node that results from the WHERE clause which is stored in (a).
match (a)
where a.description="Spider Man Series"
call apoc.refactor.cloneNodes([a],true) yield output
return a,output
. . .
I tried limiting the related nodes to 2 instead of everything "Spider Man Series" was connected to, but this ALSO gave me a clone for each related node:
match (a)-[c]-(b)
where a.description="Spider Man Series" and b.description in ['Spidey','Venom']
call apoc.refactor.cloneNodes([a],true) yield output
return a,b,c, output
apoc.refactor.cloneNodes will take the nodes you give it and create copies of them, copying the relationships from the old nodes to the new nodes if you give it true as that second parameter.
You're seeing duplication because, as you say, there are multiple rows coming back from that first query - one approach is to DISTINCT the a nodes before you do the clone:
match (a)-[c]-(b)
where a.description="Spider Man Series"
WITH distinct a as da
call apoc.refactor.cloneNodes([da],true) yield output
return output
However, if you want to create a complete copy of the subgraph, i.e. have two 'Spider Man Series' nodes, and each has three character nodes but those two subgraphs aren't connected to each other then something like apoc.refactor.cloneSubgraphFromPaths will work better:
match path=(a)-[c]-(b)
where a.description="Spider Man Series"
with collect(path) as paths
call apoc.refactor.cloneSubgraphFromPaths(paths) YIELD output
return output
I have a graph where some nodes were created out of an error in the app.
I want to delete those nodes (they represent a log), but I can't figure out how to loop thru the nodes.
I don't know how to access nodes in a collection of paths, and I need to do that in order to compare one node to another.
match (o:Order{id:123})
match (o)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
with collect((l:Log)-[:STATUS]->(os:OrderStatus)) as logs
I want to access each one of the nodes in the paths to perform a comparation. There are 5 or 6 of (l)-[:STATUS]->(os) normally for each Order.
How can I access the (l) and (os) nodes of each path, to perform the comparations between their properties?
For example, if I had this collection of paths in one of the Orders:
(log1)-[:STATUS]->(os1)
(log2)-[:STATUS]->(os2)
(log3)-[:STATUS]->(os3)
(log4)-[:STATUS]->(os2) <-- This is the error
(log5)-[:STATUS]->(os4)
So, from the collection of paths above, I'd want to detach delete the (log4), because the (os2) node is lower than the previous one (os3), and should be greater.
And after that, I want to attach the (log3) to the (log5)
NOTE: Each one of the (os) nodes has an id that represents the "status", and go from 1 to 5. Also, the (log) nodes are ordered by the created datetime.
Any idea on how to do this? Thank you in advance guys!
EDIT
I didn't mention some other scenarios I had. This is one of them:
Based on #cybersam answer, I found out how to work it out.
I had to run 2 separated queries to make it work, but the principle is the same, and is as follows:
Create new relationships:
MATCH(o:Order)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE SIZE((o)-[:STATUS_CHANGE*]->()-[:STATUS]->(os)) >= 1
WITH o, os, COLLECT(l)[0] AS keep
WITH o, collect(keep) AS k
FOREACH(i IN range(0,size(k)-1) |
FOREACH(a IN [k[i]] |
FOREACH(b IN [k[i+1]] |
FOREACH(c IN CASE WHEN b IS NOT NULL THEN [1] END | MERGE (a)-[:STATUS_CHANGE]->(b) ))));
Delete exceeded nodes:
MATCH(o:Order)-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE (os)<-[:STATUS]-()-[:STATUS_CHANGE*]->(l)-[:STATUS]->(os)
WITH o, os, COLLECT(l) AS exceed
UNWIND exceed AS del
detach delete del;
This queries worked on every scenario.
Assuming all your errors follow the same pattern (the unwanted Log nodes are always referencing an "older" OrderStatus), this may work for you:
MATCH (o:Order{id:123})-[:STATUS_CHANGE*]->(l:Log)-[:STATUS]->(os:OrderStatus)
WHERE SIZE(()-[:STATUS]->(os)) > 1
WITH os, COLLECT(l) AS logs
UNWIND logs[1..] AS unwanted
OPTIONAL MATCH (x)-[:STATUS_CHANGE]->(unwanted)-[:STATUS_CHANGE]->(y)
DETACH DELETE unwanted
FOREACH(ignored IN CASE WHEN x IS NOT NULL THEN [1] END | CREATE (x)-[:STATUS_CHANGE]->(y))
This query:
Finds (in order) all relevant OrderStatus nodes having multiple STATUS relationships.
Uses the aggregating function COLLECT to collect (in order) the Log nodes related to each of those OrderStatus nodes.
Uses UNWIND logs[1..] to get the individual unwanted Log nodes.
Uses OPTIONAL MATCH to get the 2 nodes that may need to be connected together, after the unwanted node is deleted.
Uses DETACH DELETE to deleted each unwanted node and its relationships.
Uses FOREACH to connect together the pair of nodes that might have been foiund by the OPTIONAL MATCH.
Lets say i have nodes that are connected in FRIEND relationship.
I want to query 2 of them each time, so i use SKIP and LIMIT to maintain this.
However, if someone adds a FRIEND in between calls, this messes up my results (since suddenly the 'whole list' is pushed 1 index forward).
For example, lets say i had this list of friends (ordered by some parameter):
A B C D
I query the first time, so i get A B (skipped 0 and limited 2).
Then someone adds a friend named E, list is now E A B C D.
now the second query will return B C (skipped 2 and limited 2). Notice B returned twice because the skipping method is not aware of the changes that the DB had.
Is there a way to return 2 each time starting considering the previous query? For example, if i knew that B was last returned from the query, i could provide it to the query and it would query the 2 NEXT, getting C D (Which is correct) instead of B C.
I tried finding a solution and i read about START and indexes but i am not sure how to do this.
Thanks for your time!
You could store a timestamp when the FRIEND relationship was created and order by that property.
When the FRIEND relationship is created, add a timestamp property:
MATCH (a:Person {name: "Bob"}), (b:Person {name: "Mike"})
CREATE (a)-[r:FRIEND]->(b)
SET r.created = timestamp()
Then when you are paginating through friends two at a time you can order by the created property:
MATCH (a:Person {name: "Bob"})-[r:FRIEND]->(friends)
RETURN friends SKIP {page_number_times_page_size} LIMIT {page_size}
ORDER BY r.created
You can parameterize this query with the page size (the number of friends to return) and the number of friends to skip based on which page you want.
Sorry, if It's not exactly answer to you question. On my previous project I had experience of modifying big data. It wasn't possible to modify everything with one query so I needed to split it in batches. First I started with skip limit. But for some reason in some cases it worked unpredictable (not modified all the data). And when I become tired of finding the reason I changed my approach. I used Java for querying database. So I get all the ids that I needed to modify in first query. And after this I run through stored ids.
First, sorry for my english. I am modeling a railways's DB in neo4j. I want to link station in the order that are linked by railway, using the stops's table. Every stop own a "stop sequence" that unfortunately isn't like 1,2,3 (not always) but only progressive like 1,3,4,6. I wrote this query that, for the problem described, don't work always.
MATCH (a:Station)-[:stop]->(c:Stops_times)-[:trip]->(z:Trips)<-[:trip]-(d:Stops_times)<-[:stop]-(b:Station) WHERE toint(c.stop_sequence)=toint(d.stop_sequence)+1 CREATE (a)-[s:next]->(b)
To find the right "next" I need a query similar to this:
MATCH (a:Station)-[:stop]->(c:Stops_times)-[:trip]->(z:Trips)<-[:trip]-(d:Stops_times)<-[:stop]-(b:Station) WITH c as c, d as d, MIN(d.stop_sequence) as min_ WHERE min_>c.stop_sequence CREATE UNIQUE (a)-[s:next]->(b)
therefore, for every stop, I have to found the minimum "stop_sequence" between the higher ones than the "stop_sequence" of the stop of wich i want to find the next
The following query seems to do what you want. It orders all the stops by stop_sequence, aggregates all the stops (still in order) for each trip, pairs up all adjoining stops for each trip, UNWINDs the pairs so that MERGE can use the paired nodes, and then uses MERGE to ensure that the :next relationship exists between all node pairs.
MATCH (a:Station)-[:stop]->(c:Stops_times)-[:trip]->(t:Trips)
WITH a, c, t
ORDER BY c.stop_sequence
WITH t, COLLECT(a) AS s
WITH REDUCE(x =[], i IN RANGE(1, SIZE(s)-1)| x + {a: s[i-1], b: s[i]}) AS pairs
UNWIND pairs AS p
WITH p.a AS a, p.b AS b
MERGE (a)-[n:next]->(b);
It works properly in 2.3.2 on my Mac (but the neo4j versions available at http://console.neo4j.org/ do not work correctly when the query gets to MERGE).