Neo4j proper indexes for query - neo4j

The next query in a large database takes almost a minute.
MATCH (p1:Politician)-[r1:mentioned_by]->(c:Channel {name: "TN"})<-[r2:mentioned_by {video_id: r1.video_id}]-(p2:Politician)
WHERE p1.fullname < p2.fullname
WITH DISTINCT p1.fullname AS x, p2.fullname AS y
RETURN COUNT(*) AS rows
I have added an index over the name property from the Channel nodes and saved with that at least 10 seconds, but all in all the query takes like 50 seconds which is a lot.
Any suggestion over which index to create?

One of the reasons that your query takes time may be the fact that you are comparing properties, which requires ‘opening’ many nodes.
Apparently you assume that fullName is not a unique identifier. Otherwise I would create a unique constraint on that property and compare p1 to p2

Related

Cypher to lookup and order by multiple values

I have a JSON document with history based entity counts and relationship counts. I want to use this lookup data for entity and relationships in Neo4j. Lookup data has around 3000 rows. For the entity counts I want to display the counts for two entities based on UUID. For relationships, I want to order by two relationship counts (related entities and related mutual entities).
For entities, I have started with the following:
// get JSON doc
with value.aggregations.ent.terms.buckets as data
unwind data as lookup1
unwind data as lookup2
MATCH (e1:Entity)-[r1:RELATED_TO]-(e2)
WHERE e1.uuid = '$entityId'
AND e1.uuid = lookup1.key
AND e2.uuid = lookup2.key
RETURN e1.uuid, lookup1.doc_count, r1.uuid, e2.uuid, lookup2.doc_count
ORDER BY lookup2.doc_count DESC // just to demonstrate
LIMIT 50
I'm noticing that query is taking about 10 seconds. What am I doing wrong and how can I correct it?
Attaching explain plan:
Your query is very inefficient. You stated that data has 3,000 rows (let's call that number D).
So, your first UNWIND creates an intermediate result of D rows.
Your second UNWIND creates an intermediate result of D**2 (i.e., 9 million) rows.
If your MATCH (e1:Entity)-[r1:RELATED_TO]-(e2) clause finds N results, that generates an intermediate result of up to N*(D**2) rows.
Since your MATCH clause specifies a non-directional relationship pattern, it finds the same pair of nodes twice (in reverse order). So, N is actually twice as large as it needs to be.
Here is an improved version of your query, which should be much faster (with N/2 intermediate rows):
WITH apoc.map.groupBy(value.aggregations.ent.terms.buckets, 'key') as lookup
MATCH (e1:Entity)-[r1:RELATED_TO]->(e2)
WHERE e1.uuid = $entityId AND lookup[e1.uuid] IS NOT NULL AND lookup[e2.uuid] IS NOT NULL
RETURN e1.uuid, lookup[e1.uuid].doc_count AS count1, r1.uuid, e2.uuid, lookup[e2.uuid].doc_count AS count2
ORDER BY count2 DESC
LIMIT 50
The trick here is that the query uses apoc.map.groupBy to convert your buckets (a list of maps) into a single unified lookup map that uses the bucket key values as its property names. This allows the rest of the query to literally "look up" each uuid's data in the unified map.

Filter Relationships in Neo4j Using Start/End Dates

I have a graph model -
(p:Person)-[r:LINK {startDate: timestamp, endDate: timestamp}]->(c:Company)
A person can be linked to multiple companies at the same time and a company can have multiple people linking to it at the same time (i.e. there is a many-to-many relationship between companies and people).
The endDate property is optional and will only be present when a person has left a company.
I am trying to display a network of connections and can successfully return all related nodes from a person using the following cypher query (this will display 2 levels of people connections) -
MATCH (p:Person {id:<id>})-[r:LINK*0..4]-(l) RETURN *
What I now need to do is filter the relationships where the relationships match on timeframe, e.g. Person 1 worked at Company A between 01/01/2000 and 31/12/2002. Person 2 worked at Company A between 01/01/2001 and 31/06/2001. Person 3 worked at Company A between 01/01/2005 and is still at Company A. The results for Person 1 should include Person 2 but not Person 3.
This same logic needs to be applied to all levels of the graph (we allow the user to display 3 levels of connections) and relates to the parent node in each level, i.e. when displaying level 2, the dates for Person 2 and Person 3 should be used to filter their respective relationships.
Essentially, we are trying to do something similar to the LinkedIn connections but to filter based on people working at companies at the same time.
I have tried using the REDUCE function but cannot get the logic to work for the optional end date - can someone please advise how to filter the relationships based on the start and end dates?
It turns out there are 4 ways in which date ranges can overlap, but only 2 in which they do not (person 1 ends before person 2 starts, or person 2 ends before person 1 starts), so it is much simpler to check that neither of these no-overlap conditions exist.
In the level 1 case, this query should do the trick:
MATCH (start:Person{id:1})-[r1:LINK]->(c)<-[r2:LINK]-(suggest)
WHERE NOT ((r1.endDate IS NOT NULL and r1.endDate < r2.startDate)
OR (r2.endDate IS NOT NULL and r2.endDate < r1.startDate))
RETURN suggest
The tricky part is applying this to multiple levels.
While we could create a single Cypher query to handle this dynamically, the evaluation of the relationships would only happen after expansion, not during, so it may not be the most efficient:
MATCH path = (start:Person{id:1})-[:LINK*..6]-(suggest:Person)
WITH path, start, suggest, apoc.coll.pairsMin(relationships(path)) as pairs
WITH path, start, suggest, [index in range(0, size(pairs)-1) WHERE index % 2 = 0 | pairs[index]] as pairs
WHERE none(pair in pairs WHERE (pair[0].endDate IS NOT NULL AND pair[0].endDate < pair[1].startDate)
OR (pair[1].endDate IS NOT NULL AND pair[1].endDate < pair[0].startDate))
RETURN suggest
Some of the highlights here...
We're using apoc.coll.pairsMin() from APOC Procedures to get pairs of adjacent relationships from the collection of relationships in each path, but we're only interested in the even-numbered entries (the two relationships from people working at the same company), because the odd-numbered pairs correspond to relationships from the same person going to two different companies.
So if we were executing on this pattern:
MATCH path = (start:Person)-[r1:LINK]->(c1)<-[r2:LINK]-(person2)-[r3:LINK]->(c2)<-[r4:LINK]-(person3)
The apoc.coll.pairsMin(relationships(path)) would return [[r1, r2], [r2,r3], [r3,r4]], and as you can see the relationships we need to consider are the ones linking 2 people to a company, so indexes 0 and 2 in the pairs list.
After we get our pairs we need to ensure that all of those interesting relationship pairs in the path considered to a suggestion meet your criteria and overlap (or do not NOT overlap).
Something like this should work:
MATCH path=(p:Person {id: $id})-[r:LINK*..4]-(l)
WHERE ALL(x IN NODES(path)[1..] WHERE x.startDate <= p.endDate AND x.endDate >= p.startDate)
RETURN path;
Assumptions:
The id value of the main person of interest is provided by the $id parameter.
You want the variable-length relationship pattern to have a lower bound of 1 (which is the default). If you used 0 for a lower bound, then you will also get the main person of interest as a result -- which is probably not what you want.
startDate and endDate have values that are suitable for comparison using comparison operators

Check if two nodes have a relationship in constant time

Currently I have a unique index on node with label "d:ReferenceEntity". It's taking approximately 11 seconds for this query to run, returning 7 rows. Granted T1 has about 400,000 relationships.
I'm not sure why this would take too long, considering we can build a Map of all connected Nodes to T1, thus giving constant time.
Am I missing some other index features that Neo4j can provide? Also my entire dataset is in memory, so it shouldn't have anything with going to disk.
match(n:ReferenceEntity {entityId : "T1" })-[r:HAS_REL]-(d:ReferenceEntity) WHERE d.entityId in ["T2", "T3", "T4"] return n
:schema
Indexes
ON :ReferenceEntity(entityId) ONLINE (for uniqueness constraint)
Constraints
ON (referenceentity:ReferenceEntity) ASSERT referenceentity.entityId IS UNIQUE
Explain Plan:
You had used EXPLAIN instead of PROFILE to get that query plan, so it shows misleading estimated row counts. If you had used PROFILE, then the Expand(All) operation actually would have had about 400,000 rows, since that operation would actually iterate through every relationship. That is why your query takes so long.
You can try this query, which tells Cypher use the index on d as well as n. (On my machine, I had to use the USING INDEX clause twice to get the desired results.) It definitely pays to use PROFILE to tune Cypher code.
MATCH (n:ReferenceEntity { entityId : "T1" })
USING INDEX n:ReferenceEntity(entityId)
MATCH n-[r:HAS_REL]-(d:ReferenceEntity)
USING INDEX d:ReferenceEntity(entityId)
WHERE d.entityId IN ["T2", "T3", "T4"]
RETURN n, d;
Here is the Profile Plan (In my DB, I had 2 relationships that satisfy the WHERE test):

Neo4j relate nodes by same property

I have a Neo4J DB up and running with currently 2 Labels: Company and Person.
Each Company Node has a Property called old_id.
Each Person Node has a Property called company.
Now I want to establish a relation between each Company and each Person where old_id and company share the same value.
Already tried suggestions from: Find Nodes with the same properties in Neo4J and
Find Nodes with the same properties in Neo4J
following the first link I tried:
MATCH (p:Person)
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
resulting in no change at all and as suggested by the second link I tried:
START
p=node(*), c=node(*)
WHERE
HAS(p.company) AND HAS(c.old_id) AND p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN p, c;
resulting in a runtime >36 hours. Now I had to abort the command without knowing if it would eventually have worked. Therefor I'd like to ask if its theoretically correct and I'm just impatient (the dataset is quite big tbh). Or if theres a more efficient way in doing it.
This simple console shows that your original query works as expected, assuming:
Your stated data model is correct
Your data actually has Person and Company nodes with matching company and old_id values, respectively.
Note that, in order to match, the values must be of the same type (e.g., both are strings, or both are integers, etc.).
So, check that #1 and #2 are true.
Depending on the size of your dataset you want to page it
create constraint on (c:Company) assert c.old_id is unique;
MATCH (p:Person)
WITH p SKIP 100000 LIMIT 100000
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN count(*);
Just increase the skip value from zero to your total number of people in 100k steps.

Neo4j indexing for large number of nodes

I am learning the basics of neo4j and I am looking at the following example with credit card fraud https://linkurio.us/stolen-credit-cards-and-fraud-detection-with-neo4j. Cypher query that finds stores where all compromised user shopped is
MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE r.status = “Disputed”
MATCH victim-[t:HAS_BOUGHT_AT]->(othermerchants)
WHERE t.status = “Undisputed” AND t.time < r.time
WITH victim, othermerchants, t ORDER BY t.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT t) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DESC
However, when the number of users increase (let's say to millions of users), this query may become slow since the initial query will have to traverse through all nodes labeled person. Is it possible to speed up the query by asigning properties to nodes instead of transactions? I tried to remove "status" property from relationships and add it to nodes (users, not merchants). However, when I run query with constraint WHERE victim.status="Disputed" query doesn't return anything. So, in my case person has one additional property 'status'. I assume I did a lot of things wrong, but would appreciate comments. For example
MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE victim.status = “Disputed”
returns the correct number of disputed transactions. The same holds for separately quering number of undisputed transactions. However, when merged, they yield an empty set.
If I made a mistake in my approach, how can I speed up queries for large number of nodes (avoid traversing all nodes in the first step). I will be working with a data set with similar properties, but will have around 100 million users, so I would like to index users on additional properties.
[Edited]
Moving the status property from the relationship to the person node does not seem to be the right approach, since I presume the same person can be a customer of multiple merchants.
Instead, you can reify the relationship as a node (let's label it purchase), as in:
(:person)-[:HAS_PURCHASE]->(:purchase)-[:BOUGHT_AT]->(merchant)
The purchase nodes can have the status property. You just have to create the index:
CREATE INDEX ON :purchase(status)
Also, you can put the time property in the new purchase nodes.
With the above, your query would become:
MATCH (victim:person)-[:HAS_PURCHASE]->(pd:purchase)-[:BOUGHT_AT]->(merchant)
WHERE pd.status = “Disputed”
MATCH victim-[:HAS_PURCHASE]->(pu:purchase)-[:BOUGHT_AT]->(othermerchants)
WHERE pu.status = “Undisputed” AND pu.time < pd.time
WITH victim, othermerchants, pu ORDER BY pu.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT pu) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DES

Resources