Why is my Neo4J composite index not being used with MATCH and ORDER BY? - neo4j

I have a lot of nodes of label Person with properties treeId, firstName, lastName.
I am trying to implement a performant endless scroll of all Persons with some treeId, ordered alphabetically:
MATCH (p:Person {treeId: "admin"}) RETURN p ORDER BY p.lastName, p.firstName SKIP 100 LIMIT 20
Question: What index do I need to create for this operation to run on indexes as much as possible?
I attempted to create such an index:
CREATE INDEX personTreeLastNameFirstName FOR (p:Person) ON (p.treeId, p.lastName, p.firstName)
but with this index, the first operation is NodeByLabelScan, so the index is not used.
Another index I tried is more helpful:
CREATE INDEX personTree FOR (p:Person) ON p.treeId
the first operation is NodeIndexSeek when using it, but it doesn't include the names, so every Person with the specified treeId needs to be read from the database.
What index do I need to create, or how do I need to rewrite the query for it to be more performant on large amounts of Persons with the same treeId?

The index :
CREATE INDEX personTree FOR (p:Person) ON p.treeId
only indexes treeId, hence it can only be used to sort and search on treeIds.
The composite index:
CREATE INDEX personTreeLastNameFirstName FOR (p:Person) ON (p.treeId, p.lastName, p.firstName)
indexes treeId, lastName and firstName, but the catch here is it will only be used if all the three indexed keys are present in the search clause, that's why you are getting NodeByLabelScan. To allow neo4j to use your composite index, you should add some search criteria for firstName and lastName. Like this:
MATCH (p:Person)
WHERE p.treeId= "admin" AND p.firstName > "" AND p.lastName > ""
RETURN p
ORDER BY p.lastName, p.firstName
SKIP 100 LIMIT 20

Related

Create relationship between each two nodes for a set of nodes

I have created many nodes in neo4j, the attributes of these nodes are the same, they all have user_id and item_id, the code used is as follows:
LOAD CSV WITH HEADERS FROM 'file://data.csv' AS row
CREATE (main:Main_table {USER_ID: row.user_id,
ITEM_ID: row.item_id}
)
CREATE INDEX ON :Main_table(USER_ID);
CREATE INDEX ON :Main_table(ITEM_ID);
Now I want to create relationship between the nodes with the same user_id or item_id. For example, if node A, B and C have the same USER_ID, I want to create (A)-[:EDGE]->(B), (A)-[:EDGE]->(C) and (B)-[:EDGE]->(C). In order to achieve this goal, I tried the following code:
MATCH (a:Main_table),(b:Main_table)
WHERE a.USER_ID = b.USER_ID
CREATE (a)-[:USER_EDGE]->(b);
MATCH (a:Main_table),(b:Main_table)
WHERE a.ITEM_ID = b.ITEM_ID
CREATE (a)-[:ITEM_EDGE]->(b);
But due to the large amount of data (3000000 nodes, 100000 users), this process is very slow, how can I quickly complete this process? Any help would be greatly appreciated!
Your query is causing a cartesian product, and the Cypher planner does not use indexes to optimize node lookups involving node property comparisons.
A query like this (instead of your USER_EDGE query) may be faster, as it does not cause a cartesian product:
MATCH (a:Main_table)
WITH a.USER_ID AS id, COLLECT(a) AS mains
UNWIND mains AS a
UNWIND mains AS b
WITH a, b
WHERE ID(a) < ID(b)
MERGE (a)-[:USER_EDGE]->(b)
This query uses the aggregating function COLLECT to collect the nodes that have the same USER_ID value, and uses the ID(a) < ID(b) test to ensure that a and b are not the same nodes and to also prevent duplicate relationships (in opposite directions).

Preserving a query result for the duration of the query in Neo4j Cypher

I am using neo4j 3.5.2 Desktop with Nodejs. I am trying to update a user record properties and add/remove relationship with other nodes in same query:
my query look like this:
MATCH (user:Dealer {email: $paramObj.email})
SET user += apoc.map.clean($paramObj, ["email","vehicles"],[])
WITH user, $paramObj.vehicles AS vehicles
UNWIND vehicles AS vehicle
MATCH(v:Vehicles {name:vehicle})
MERGE (user)-[r:SUPPLY_PARTS_FOR]->(v)
ON CREATE SET r.since = timestamp()
WITH vehicles,user
MATCH (user)-[r:SUPPLY_PARTS_FOR]->(v)
WHERE NOT apoc.coll.contains(vehicles,v.name)
DELETE r
WITH $paramObj.email AS dealeremail
MATCH (user:Dealer {email: dealeremail})
RETURN user
The issue I am having is the return of empty 'user' array when a query related to deleting a vehicle relationship (r) result in zero rows.
How do I preserve the original 'user' result or save the email address to redo the query. I tried using WITH $paramObj.email AS dealerEmail but it seems that I cannot forward the dealerEmail...Thought I could.
This problem is as a result of returning zero rows so it dawned on me that the OPTIONAL MATCH would also return a NULL result but with a single row with null values. So I change the MATCH searching for a relationship to delete to an OPTIONAL MATCH.
MATCH (user:Dealer {email: $paramObj.email})
SET user += apoc.map.clean($paramObj, ["email","vehicles"],[])
WITH user, $paramObj.vehicles AS vehicles
UNWIND vehicles AS vehicle
MATCH(v:Vehicles {name:vehicle})
MERGE (user)-[r:SUPPLY_PARTS_FOR]->(v)
ON CREATE SET r.since = timestamp()
WITH vehicles,user
OPTIONAL MATCH (user)-[r:SUPPLY_PARTS_FOR]->(v)
WHERE NOT apoc.coll.contains(vehicles,v.name)
DELETE r
RETURN user
This did the trick

Neo4j how to create a property set, with out duplicate values unlike a List

I have a :hobby relation between :User nodes.
The hobby relation contains a List property { hobbies:['football','hockey'] }
Now I am iterating through a data stream and i want to uniquely merge the hobbies into this List (Like a set). I tried using coalesce like this:
MERGE (from)-[rel:hobbies]->(to)
set rel.hobbies= COALESCE(rel.hobbies, []) + 'football';
The problem is that now my property contains duplicates
{ hobbies:['football','hockey','football'] }
How can i avoid duplicates?
This Cypher query should work without need of APOC procedures.
MERGE (from)-[rel:hobbies]->(to)
WITH rel, COALESCE(rel.hobbies, []) + 'football' AS hobbies
UNWIND hobbies as r
WITH rel, collect(distinct r) AS unique
set rel.hobbies = unique
This query use UNWIND to expand the hobbies array and after it collect the unique hobbies into an variable called unique. If you don't have APOC procedures in your Neo4j Server use this query.
[UPDATED]
This query will add 'football' to the hobbies collection only if it does not already exist (by doing a check first):
MERGE (from)-[rel:hobbies]->(to)
FOREACH(x in CASE WHEN NOT ('football' IN rel.hobbies) THEN [1] END |
SET rel.hobbies = COALESCE(rel.hobbies, []) + 'football')
Instead of hardcoding the hobby to add (e.g., 'football'), you should use a parameter.
Also, you should consider altering your data model to use Hobby nodes to represent the different hobbies, which is a more graph-oriented approach.

Create relationships from sequence of events

I have a CSV with log of events that has following columns: EventType, UserId, RecordId (an auto-incremented sequence number). I want to import to Neo4j and build a node for every EventType (around 100 unique types) and then analyze paths using relationships. To build relationship I need to match all raw events and find the "next" event in the path, which means I need to match it with event that has same UserId and next RecordId is larger than the current RecordId (next RecordId > current RecordId).
What is the efficient way to do this in Cypher? Somehow I come up with queries that involve a Cartesian product, which are very slow.
I think you cannot avoid Cartesian products in this case. However, you can
Make them as small as possible.
Use indexing to improve the speed of your queries.
Besides using EventType as a node label ("unique type"), I strongly recommend to use an additional Event label for all events so that you can index the userId value and recordId values.
CREATE INDEX ON :Event(recordId)
CREATE INDEX ON :Event(userId)
I created a small example data set:
CREATE
(e1:Event:Skating {userId: 1, recordId: 1}),
(e2:Event:Hiking {userId: 1, recordId: 2}),
(e3:Event:Mountaineering {userId: 1, recordId: 3})
To get the next recordId, you need to satisfy that nextRecordId > currentRecordId and also the nextRecordId must be the smallest one (as the recordId comes from an auto-incremented sequence). We than connect the two events using MERGE (CREATE also works, but using MERGE makes sure that we avoid creating duplicate edges). This gives the following query:
MATCH (a:Event), (b:Event)
WHERE a.userId = b.userId
AND a.recordId < b.recordId
WITH a, min(b.recordId) AS bRecordId
MATCH (b {recordId: bRecordId})
MERGE (a)-[:NEXT]->(b)
This query creates a Cartesian product for all user ids. As long as the users do not participate in hundreds of events, the size of the Cartesian products should not grow huge. Note that the first MATCH uses both indices (userId and recordId), while the second MATCH uses the index on recordId.

How to fetch nodes with different labels in one query in Neo4j

I have three types of nodes:
(:Meal), (:User), (:Dish)
With relationships:
(:Meal)<-[:JOIN]-(:User), (:Meal)<-[:ORDERED]-(:Dish)
Now I want to fetch the information of meal in one query. I want to get result like this:
id: 1
name: xxx,
users: [1,2,3,4],
dishes: [23,42,42]
where users and dishes fields contains the ids of those users and dishes.
I tried:
MATCH (meal:Meal)
OPTIONAL MATCH (meal)<-[:JOIN]-(user:User)
OPTIONAL MATCH (meal)<-[:ORDERED]-(dish:Dish)
RETURN id(meal), meal.name, COLLECT(ID(user)), COLLECT(ID(dish))
However, this query will generate a lot duplication of users and dishes. If there are N users and M dishes, it will match N*M user-dish pairs.
I do realize that I can use DISTINCT to remove duplication. However, I am not sure about the efficiency of such query.
Is there any better way?
Try to separate the different parts of your query using WITH:
MATCH (meal:Meal)
OPTIONAL MATCH (meal)<-[:JOIN]-(user:User)
WITH meal, collect(ID(user)) as users
OPTIONAL MATCH (meal)<-[:ORDERED]-(dish:Dish)
RETURN id(meal), meal.name, users, COLLECT(ID(dish)) as dishes

Resources