Neo4J Cypher Query on 5M nodes

Neo4J Cypher Query on 5M nodes - neo4j

This is the basic query I am trying:
MATCH (b1:Business),(b2:Business) WHERE ID(b1)<>ID(b2) AND b1.name[0]=b2.name[0]
WITH b1,b2,apoc.create.uuid() as uuid
MERGE (b1)-[d:MCC_NAME]->(b2)
ON CREATE
SET d.m_score = 100
SET d.m_event = uuid
SET d.m_dt = datetime()
RETURN count (d)
I have also tried to separate the query and run through apoc.periodic.iterate() but in either case the query runs forever and never yields results. The name property is an array but at present there are only single entries in it, so I tried to simplify by using simple comparison of name[0], but it didn't help. The database is fairly large, about 5 million nodes. Any advice appreciated.

I would do this:
MATCH (b1:Business)
WITH b1
MATCH(b2:Business) WHERE b1.name[0]=b2.name[0] AND b1<>b2
WITH b1,b2,apoc.create.uuid() as uuid
MERGE (b1)-[d:MCC_NAME]->(b2)
SET …
Make sure you have an index set on name.
When you do not want to have the edges to be bi-directional, you could do
WHERE id(b1)>id(b2)

Related

Neo4j cypher query for connected components is too slow

I am looking for help to optimize my following cypher query.
CALL algo.unionFind.stream()
YIELD nodeId,setId
MATCH(n) where ID(n) = nodeId AND NOT (n)-[:IS_CHILD_OF]-()
call apoc.create.uuids(1) YIELD uuid
WITH n as nod, uuid, setId
WHERE nod is not null
MERGE(groupid:GroupId {group_id:'id_'+toString(setId)})
ON CREATE set groupid.group_value = uuid, groupid.updated_at = '1512135348335'
MERGE(nod)-[:IS_CHILD_OF]->(groupid)
RETURN count(nod);
I have already applied the unique constraints and index over group_id. Even I am using a good configurations machine i3-2xl.
The above query is taking too long time ~22 minutes for ~500k nodes.
Following are the things I want to achieve from the above query.
Get all the connected components(sub-graph).
Create a new node for each group(connected components).
Assign uuid as a value of the new group node.
Build the relationship with all the group members with the new group node.
Any suggestions are welcome to optimize the above query, or please let me know if is there any other way to achieve my listed requirement.

querying a DB for an unknown element with a given uuid

Lot of years ago, I discussed with some neo4j engineers about the ability to query an unknown object given it's uuid.
At that time, the answer was that there was no general db index in neo4j.
Now, I have the same problem to solve:
each node I create has an unique id (uuid in the form <nx:u-<uuid>-?v=n> where ns is the namespace, uuid is a unique uuid and v=n is the version number of the element.
I'd like to have the ability to run the following cypher query:
match (n) where n.about = 'ki:u-SSD-v5.0?v=2' return n;
which actually return nothing.
The following query
match (n:'mm:ontology') where n.about = 'ki:u-SSD-v5.0?v=2' return n;
returns what I need, despite the fact that at query time I don't know the element type.
Can anyone help on this?
Paolo

Have you considered adding a achema index to every node in the database for the about attribute?
For instance
Add a global label to all nodes in the graph (e.g. Node) that do not already have it. If your graph is overly large and/or heap overly small you will need to batch this operation. Something along the lines of the following...
MATCH (n)
WHERE NOT n:Node
WITH n
LIMIT 100000
SET n:Node
After the label is added then create an index on the about attribute for your new global label (e.g. Node). These steps can be performed interchangeably as well.
CREATE CONSTRAINT ON (node:Node) assert node.about IS UNIQUE
Then querying with something like the following
MATCH (n:Node)
WHERE n.about = 'ki:u-SSD-v5.0?v=2'
RETURN n;
will return the node you are seeking in a performant manner.

Neo4j relate nodes by same property

I have a Neo4J DB up and running with currently 2 Labels: Company and Person.
Each Company Node has a Property called old_id.
Each Person Node has a Property called company.
Now I want to establish a relation between each Company and each Person where old_id and company share the same value.
Already tried suggestions from: Find Nodes with the same properties in Neo4J and
Find Nodes with the same properties in Neo4J
following the first link I tried:
MATCH (p:Person)
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
resulting in no change at all and as suggested by the second link I tried:
START
p=node(*), c=node(*)
WHERE
HAS(p.company) AND HAS(c.old_id) AND p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN p, c;
resulting in a runtime >36 hours. Now I had to abort the command without knowing if it would eventually have worked. Therefor I'd like to ask if its theoretically correct and I'm just impatient (the dataset is quite big tbh). Or if theres a more efficient way in doing it.

This simple console shows that your original query works as expected, assuming:
Your stated data model is correct
Your data actually has Person and Company nodes with matching company and old_id values, respectively.
Note that, in order to match, the values must be of the same type (e.g., both are strings, or both are integers, etc.).
So, check that #1 and #2 are true.

Depending on the size of your dataset you want to page it
create constraint on (c:Company) assert c.old_id is unique;
MATCH (p:Person)
WITH p SKIP 100000 LIMIT 100000
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN count(*);
Just increase the skip value from zero to your total number of people in 100k steps.

How to optimize the calculation of the shortest time in a certain path in Neo4j?

I have a database in Neo4j of modules that I imported through CSV. The data looks something like this. Each module has its name, it's module that is the successor, average time duration and another duration called medtime.
I have been able to import the data and to set the relationships through a Cypher Query script that looks like this:
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
CREATE (n:Module)
SET n = row, n.name = row.name, n.mafter = row.mafter, n.avgtime = row.avgtime, n.medtime = row.medtime
WITH n
RETURN n
Then I have set the relationships like this:
Match (p:Module),(q:Module)
Where p.mafter = q.name
Merge (p)-[:PRECEEDS]->(q)
Return p,q
Now to the point. I want to calculate the shortest path from a certain module to another, more specifically the time that it takes to get from a module to another and for this, I use the more or less copied part of the script from
http://www.neo4j.org/graphgist?8412907 and that is
MATCH p = (trop:Module {name:'BLSACXAMT0A_00'})-[prec:PRECEEDS*]->(hop:Module {name:'BL_LOAD_CLOSE'})
WITH p, REDUCE(x = 0, a IN NODES(p) | x + a.avgtime) AS cum_duration
ORDER BY cum_duration DESC
LIMIT 1
RETURN cum_duration AS `Total Average Time`
This, however, takes about 50 second to execute and that is outrageous. You can see it on the screenshot right below. The ammount of modules imported into the database is only about 2000 and what I want to achieve, is to successfully work with more than 50 000 nodes and perform such tasks much faster.
Other issue is, that the results are somehow suspicious. The format looks wrong, every number I have in the database has max 4 digits after the decimal point and I am only adding these values to zero, therefore if the result looks like this: 00103,68330,51670, I have serious doubts. Please, help me, if it is wrong, why is it so, and what can I do to correct it.
Neo4j claims that it is efficient and fast, therefore I presume that the fault is in my code (the performance of my computer is more than enough). Please, If you can, help me to shorten this time and explain the patterns needed to perform this.

A few observations that should help:
You have several errors in how you are importing. These errors will create many more nodes than you think, and create the "suspicious" issue you raised:
Your file has multiple rows with the same name, but your import is creating a new Module node every time. Therefore, you are ending up with multiple nodes for some of your modules. You should be using MERGE instead of CREATE.
Your mafter property needs to contain a collection of strings, not a single string.
You are importing the numeric values as strings, so code such as x + a.avgtime is just doing string concatenation, not numeric addition. Furthermore, even if you did attempt to convert your strings to numbers, that would fail because your numbers use a comma instead of a period to indicate the decimal place.
Try this for importing (into an empty DB):
LOAD CSV WITH HEADERS FROM "file:c:/users/Skelo/Desktop/Neo4J related/Statistic Dependencies/Simple.csv" AS row FIELDTERMINATOR ';'
MERGE (n:Module {name: row.name})
ON CREATE SET
n.mafter = [row.mafter],
n.avgtime = TOFLOAT(REPLACE(row.avgtime, ',', '.')),
n.medtime = TOFLOAT(REPLACE(row.medtime, ',', '.'))
ON MATCH SET
n.mafter = n.mafter + row.mafter;
You also need to change your current merge query so that you can handle an mafter that is a collection. Note that the following query is designed to NOT create any new nodes (even if a name in mafter does not yet have a module node).
MATCH (p:Module)
OPTIONAL MATCH (p)-[:PRECEEDS]->(z:Module)
WITH p, COLLECT(z.name) AS existing
WITH p, filter(x IN p.mafter
WHERE NOT x IN existing) AS todo
MATCH (q:Module)
WHERE q.name IN todo
MERGE (p)-[:PRECEEDS]->(q)
RETURN p, q;
You should create an index to speed up the matching of modules by name:
CREATE INDEX ON :Module(name)

Cypher does have a shortestPath function, see http://neo4j.com/docs/stable/query-match.html#_shortest_path. However this calculates the shortest path based on the number of hops and does not take a weight into account.
Neo4j has couple of graph algorithms on board, e.g. Dijekstra or AStar. Unfortunately these are not yet available via cypher. Instead you have two alternatives to use them:
1) write an unmanaged extension to Neo4j and use GraphAlgoFactory in the implmentation. This requires to write same java code and deploy it to the Neo4j server. Using a custom CostEvaluator you can use the avgTime property on your nodes as cost parameter.
2) use the REST API as documented on http://neo4j.com/docs/stable/rest-api-graph-algos.html#rest-api-execute-a-dijkstra-algorithm-and-get-a-single-path. This approach requires to have the weight as a property on the relationship and not on a node (like in your data model)

MATCH prior to MERGE in single Cypher query confuses ON MATCH and ON CREATE

I've run into this on Neo4j 2.1.5. I have a query which I'm issuing from Node.js using the Neo4j REST API. The point of this query is to be able to create or update a given Node and set its state (including labels and properties) to some known state. The MATCH and REMOVE clause prior to the WITH is to work around the fact that there's no direct way to remove all of a Node's labels nor is there a way to update a Node's labels with a given set of labels. You have to explicitly remove the labels you don't want and add the labels you do want. And there's no way to remove labels in the MERGE clause.
A somewhat simplified version of the query looks like:
MATCH (m {name:'Brian'})
REMOVE m:l1:l2
WITH m
MERGE (n {name:'Brian'})
ON MATCH SET n={mprops} ON CREATE SET n={cprops}
RETURN n
where mprops = {updated:true, created:false} and cprops = {updated:false, created:true}. I do this so that in a single Cypher query I can remove all of the Node's existing labels and set new labels using the ON MATCH clause. The problem is that including the initial MATCH seems to confuse the ON MATCH vs ON CREATE logic.
Assuming the Brian Node already exists, the result of this query should show that n.created = false and n.updated = true. However, I get the opposite result, n.created=true, n.updated=false. If I remove the initial MATCH (and WITH) clause and execute only the MERGE clause, the results are as expected. So somehow, the inclusion of the MATCH clause causes the MERGE clause to think that a CREATE vs MATCH is happening.
I realize this is a weird use of the WITH clause, but it did seem like it would work around the limitation in manipulating labels. And Cypher thinks that it's valid Cypher. I'm assuming this is just a bug and an edge case, but I wanted to get others insights and possible alternatives before I report it.
I realize that I could have created a transaction and issued the MATCH and MERGE as separate queries within that transaction, but there are reasons that this does not work well in the design of the API I'm writing.
Thanks!

If you prefix your query with MATCH it will never execute if there is no existing ('Brian') node.
You also override all properties with your SET n = {param} you should use SET n += {param}
MERGE (n:Label { name:'Brian' })
ON MATCH SET n += {create :false,update:true }
ON CREATE SET n += {create :true,update:false }
REMOVE n:WrongLabel
RETURN n

I don't see why your query would not work, but the issues brought up by #FrobberOfBits are valid.
However, logically, your example query is equivalent to this one:
MATCH (m {name:'Brian'})
REMOVE m:l1:l2
SET m={mprops}
RETURN m
This query is simpler, avoids the use of MERGE entirely, and may avoid whatever issue you are seeing. Does this represent what you were trying to do?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart