Neo4j cypher query for connected components is too slow - neo4j

I am looking for help to optimize my following cypher query.
CALL algo.unionFind.stream()
YIELD nodeId,setId
MATCH(n) where ID(n) = nodeId AND NOT (n)-[:IS_CHILD_OF]-()
call apoc.create.uuids(1) YIELD uuid
WITH n as nod, uuid, setId
WHERE nod is not null
MERGE(groupid:GroupId {group_id:'id_'+toString(setId)})
ON CREATE set groupid.group_value = uuid, groupid.updated_at = '1512135348335'
MERGE(nod)-[:IS_CHILD_OF]->(groupid)
RETURN count(nod);
I have already applied the unique constraints and index over group_id. Even I am using a good configurations machine i3-2xl.
The above query is taking too long time ~22 minutes for ~500k nodes.
Following are the things I want to achieve from the above query.
Get all the connected components(sub-graph).
Create a new node for each group(connected components).
Assign uuid as a value of the new group node.
Build the relationship with all the group members with the new group node.
Any suggestions are welcome to optimize the above query, or please let me know if is there any other way to achieve my listed requirement.

Related

Neo4J Cypher Query on 5M nodes

This is the basic query I am trying:
MATCH (b1:Business),(b2:Business) WHERE ID(b1)<>ID(b2) AND b1.name[0]=b2.name[0]
WITH b1,b2,apoc.create.uuid() as uuid
MERGE (b1)-[d:MCC_NAME]->(b2)
ON CREATE
SET d.m_score = 100
SET d.m_event = uuid
SET d.m_dt = datetime()
RETURN count (d)
I have also tried to separate the query and run through apoc.periodic.iterate() but in either case the query runs forever and never yields results. The name property is an array but at present there are only single entries in it, so I tried to simplify by using simple comparison of name[0], but it didn't help. The database is fairly large, about 5 million nodes. Any advice appreciated.
I would do this:
MATCH (b1:Business)
WITH b1
MATCH(b2:Business) WHERE b1.name[0]=b2.name[0] AND b1<>b2
WITH b1,b2,apoc.create.uuid() as uuid
MERGE (b1)-[d:MCC_NAME]->(b2)
SET …
Make sure you have an index set on name.
When you do not want to have the edges to be bi-directional, you could do
WHERE id(b1)>id(b2)

Is matching with id performant in Neo4J?

I'm wondering, when I have read the data of a node and I want to match it in another query, which way will have the best performance? Using id like this:
MATCH (n) WHERE ID(n) = 1234
or using indices of the node:
MATCH (n:Label {SomeIndexProperty: 3456})
Which one is better?
IDs are a technical ID for Neo4j, and those should not be used as a primary key for your application.
Every node (and relationship) has a technical ID, and it's stable over time.
But if you delete a node, for example the node 32, Neo4j will reuse this ID for a new node.
So you can use it in your queries inside the same transaction (there is no problem), otherwise you should know what you are doing.
The only way to retrieve the technical ID, is to use the function ID like you do on your first query : MATCH (n) WHERE ID(n) = 1234 RETURN n.
The ID is not exposed as a node's property, so you can't do MATCH (n {ID:1234}) RETURN n.
You have noticed that if you want to do a WHERE on a strict equality, you can do put the condition directly on the node.
For example :
MATCH (n:Node) WHERE n.name = 'logisima' RETURN n
MATCH (n:Node {name:'logisima'}) RETURN n
Those two queries are identicals, they generate the same query plan, it's just a syntactic sugar.
Is it faster to retrieve a node by its ID or by an indexed property ?
The easier way to know the answer to this question is to profile the two queries.
On the one based on the ID, you will see the box NodeByIdSeek that cost 1 db hit, and on the one with a unique constrainst you will see the box NodeUniqueIndexSeek with 2 db hits.
So searching a node by its ID is faster.

querying a DB for an unknown element with a given uuid

Lot of years ago, I discussed with some neo4j engineers about the ability to query an unknown object given it's uuid.
At that time, the answer was that there was no general db index in neo4j.
Now, I have the same problem to solve:
each node I create has an unique id (uuid in the form <nx:u-<uuid>-?v=n> where ns is the namespace, uuid is a unique uuid and v=n is the version number of the element.
I'd like to have the ability to run the following cypher query:
match (n) where n.about = 'ki:u-SSD-v5.0?v=2' return n;
which actually return nothing.
The following query
match (n:'mm:ontology') where n.about = 'ki:u-SSD-v5.0?v=2' return n;
returns what I need, despite the fact that at query time I don't know the element type.
Can anyone help on this?
Paolo
Have you considered adding a achema index to every node in the database for the about attribute?
For instance
Add a global label to all nodes in the graph (e.g. Node) that do not already have it. If your graph is overly large and/or heap overly small you will need to batch this operation. Something along the lines of the following...
MATCH (n)
WHERE NOT n:Node
WITH n
LIMIT 100000
SET n:Node
After the label is added then create an index on the about attribute for your new global label (e.g. Node). These steps can be performed interchangeably as well.
CREATE CONSTRAINT ON (node:Node) assert node.about IS UNIQUE
Then querying with something like the following
MATCH (n:Node)
WHERE n.about = 'ki:u-SSD-v5.0?v=2'
RETURN n;
will return the node you are seeking in a performant manner.

Neo4j relate nodes by same property

I have a Neo4J DB up and running with currently 2 Labels: Company and Person.
Each Company Node has a Property called old_id.
Each Person Node has a Property called company.
Now I want to establish a relation between each Company and each Person where old_id and company share the same value.
Already tried suggestions from: Find Nodes with the same properties in Neo4J and
Find Nodes with the same properties in Neo4J
following the first link I tried:
MATCH (p:Person)
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
resulting in no change at all and as suggested by the second link I tried:
START
p=node(*), c=node(*)
WHERE
HAS(p.company) AND HAS(c.old_id) AND p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN p, c;
resulting in a runtime >36 hours. Now I had to abort the command without knowing if it would eventually have worked. Therefor I'd like to ask if its theoretically correct and I'm just impatient (the dataset is quite big tbh). Or if theres a more efficient way in doing it.
This simple console shows that your original query works as expected, assuming:
Your stated data model is correct
Your data actually has Person and Company nodes with matching company and old_id values, respectively.
Note that, in order to match, the values must be of the same type (e.g., both are strings, or both are integers, etc.).
So, check that #1 and #2 are true.
Depending on the size of your dataset you want to page it
create constraint on (c:Company) assert c.old_id is unique;
MATCH (p:Person)
WITH p SKIP 100000 LIMIT 100000
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN count(*);
Just increase the skip value from zero to your total number of people in 100k steps.

Neo4j indexing for large number of nodes

I am learning the basics of neo4j and I am looking at the following example with credit card fraud https://linkurio.us/stolen-credit-cards-and-fraud-detection-with-neo4j. Cypher query that finds stores where all compromised user shopped is
MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE r.status = “Disputed”
MATCH victim-[t:HAS_BOUGHT_AT]->(othermerchants)
WHERE t.status = “Undisputed” AND t.time < r.time
WITH victim, othermerchants, t ORDER BY t.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT t) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DESC
However, when the number of users increase (let's say to millions of users), this query may become slow since the initial query will have to traverse through all nodes labeled person. Is it possible to speed up the query by asigning properties to nodes instead of transactions? I tried to remove "status" property from relationships and add it to nodes (users, not merchants). However, when I run query with constraint WHERE victim.status="Disputed" query doesn't return anything. So, in my case person has one additional property 'status'. I assume I did a lot of things wrong, but would appreciate comments. For example
MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE victim.status = “Disputed”
returns the correct number of disputed transactions. The same holds for separately quering number of undisputed transactions. However, when merged, they yield an empty set.
If I made a mistake in my approach, how can I speed up queries for large number of nodes (avoid traversing all nodes in the first step). I will be working with a data set with similar properties, but will have around 100 million users, so I would like to index users on additional properties.
[Edited]
Moving the status property from the relationship to the person node does not seem to be the right approach, since I presume the same person can be a customer of multiple merchants.
Instead, you can reify the relationship as a node (let's label it purchase), as in:
(:person)-[:HAS_PURCHASE]->(:purchase)-[:BOUGHT_AT]->(merchant)
The purchase nodes can have the status property. You just have to create the index:
CREATE INDEX ON :purchase(status)
Also, you can put the time property in the new purchase nodes.
With the above, your query would become:
MATCH (victim:person)-[:HAS_PURCHASE]->(pd:purchase)-[:BOUGHT_AT]->(merchant)
WHERE pd.status = “Disputed”
MATCH victim-[:HAS_PURCHASE]->(pu:purchase)-[:BOUGHT_AT]->(othermerchants)
WHERE pu.status = “Undisputed” AND pu.time < pd.time
WITH victim, othermerchants, pu ORDER BY pu.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT pu) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DES

Resources