Simple Neo4j query is very slow on large database - neo4j

I have a Neo4J database with the following properties:
Array Store 8.00 KiB
Logical Log 16 B
Node Store 174.54 MiB
Property Store 477.08 MiB
Relationship Store 3.99 GiB
String Store Size 174.34 MiB
MiB Total Store Size 5.41 GiB
There are 12M nodes and 125M relationships.
So you could say this is a pretty large database.
My OS is windows 10 64bit, running on an Intel i7-4500U CPU #1.80Ghz with 8GB of RAM.
This isn't a complete powerhouse, but it's a decent machine and in theory the total store could even fit in RAM.
However when I run a very simple query (using the Neo4j Browser)
MATCH (n {title:"A clockwork orange"}) RETURN n;
I get a result:
Returned 1 row in 17445 ms.
I also used a post request with the same query to http://localhost:7474/db/data/cypher, this took 19seconds.
something like this:
http://localhost:7474/db/data/node/15000
is however executed in 23ms...
And I can confirm there is an index on title:
Indexes
ON :Page(title) ONLINE
So anyone have ideas on why this might be running so slow?
Thanks!

This has to scan all nodes in the db - if you re-run your query using n:Page instead of just n, it'll use the index on those nodes and you'll get better results.
To expand this a bit more - INDEX ON :Page(title) is only for nodes with a :Page label, and in order to take advantage of that index your MATCH() needs to specify that label in its search.
If a MATCH() is specified without a label, the query engine has no "clue" what you're looking for so it has to do a full db scan in order to find all the nodes with a title property and check its value.
That's why
MATCH (n {title:"A clockwork orange"}) RETURN n;
is taking so long - it has to scan the entire db.
If you tell the MATCH() you're looking for a node with a :Page label and a title property -
MATCH (n:Page {title:"A clockwork orange"}) RETURN n;
the query engine knows you're looking for nodes with that label, it also knows that there's an index on that label it can use - which means it can perform your search with the performance you're looking for.

Related

Neo4J Very Large Admin Import with limited RAM

I am importing several TB of CSV data into Neo4J for a project I have been working on. I have enough fast storage for the estimated 6.6TiB, however the machine has only 32GB of memory, and the import tool is suggesting 203GB to complete the import.
When I run the import, I see the following (I assume it exited because it ran out of memory). Is there any way I can import this large dataset with the limited amount of memory I have? Or if not with the limited amount of memory I have, with the maximum ~128GB that the motherboard this machine can support.
Available resources:
Total machine memory: 30.73GiB
Free machine memory: 14.92GiB
Max heap memory : 6.828GiB
Processors: 16
Configured max memory: 21.51GiB
High-IO: true
WARNING: estimated number of nodes 37583174424 may exceed capacity 34359738367 of selected record format
WARNING: 14.62GiB memory may not be sufficient to complete this import. Suggested memory distribution is:
heap size: 5.026GiB
minimum free and available memory excluding heap size: 202.6GiB
Import starting 2022-10-08 19:01:43.942+0000
Estimated number of nodes: 15.14 G
Estimated number of node properties: 97.72 G
Estimated number of relationships: 37.58 G
Estimated number of relationship properties: 0.00
Estimated disk space usage: 6.598TiB
Estimated required memory usage: 202.6GiB
(1/4) Node import 2022-10-08 19:01:43.953+0000
Estimated number of nodes: 15.14 G
Estimated disk space usage: 5.436TiB
Estimated required memory usage: 202.6GiB
.......... .......... .......... .......... .......... 5% ∆1h 38m 2s 867ms
neo4j#79d2b0538617:~/import$
TL:DR; Using Periodic Commit, or Transaction Batching
If you're trying to follow the Operations Manual: Neo4j Admin Import, and your csv matches the movies.csv in that example, I would suggest instead doing a more manual USING PERIODIC COMMIT LOAD CSV...:
Stop the db.
Put your csv at neo4j/import/myfile.csv.
If you're using Desktop: Project > DB > click the ... on the right >
Open Folder
Add the APOC plugin.
Start the DB.
Next, open a browser instance, run the following (adjust for your data), and leave it until tomorrow:
USING PERIODIC COMMIT LOAD CSV FROM 'file:///myfile.csv' AS line
WITH line[3] AS nodeLabels, {
id: line[0],
title: line[1],
year: toInteger(line[2])
} AS nodeProps
apoc.create.node(SPLIT(line[3],';',
Note: There are many ways to solve this problem, depending on your source data and the model you wish to create. This solution is only meant to give you a handful of tools to help you get around the memory limit. If it is a simple CSV, and you don't care about what labels the nodes get initially, and you have headers, you can skip the complex APOC, and probably just do something like the following:
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///myfile.csv' AS line
CREATE (a :ImportedNode)
SET a = line
File for Each Label
Original Asker mentioned having a separate csv for each label. In such instances it may be helpful to have a great-big single-command that can handle all of it, rather than needing to manually step through each step of the operation.
Assuming two label-types, each with a unique 'id' property, and one with a 'parent_id' referencing the other label...
UNWIND [
{ file: 'country.csv', label: 'Country'},
{ file: 'city.csv', label: 'City'}
] AS importFile
USING PERIODIC COMMIT LOAD CSV FROM 'file:///' + importFile.file AS line
CALL apoc.merge.node([importFile.label], {id: line.id}) YIELD node
SET node = line
;
// then build the relationships
MATCH (city :City)
WHERE city.parent_id
MATCH (country :Country {id: city.parent_id)
MERGE (city)-[:IN]->(country)

neo4j index failed immediately without error

NOTE this is not the same question Neo4j Index creation fails
We have 2 types of nodes in our DB which both have a string property called text. For node A this property is not unique and for node B this property is unique. (we never explicitly have any cypher code that creates any boundary/unique conditions on anything)
We want to have 2 indexes:
create index on :A(text)
create index on :B(text)
(i am currently running those cypher queries seperately from the browser interface)
the index on :B(text) takes a minute and works fine (4 million nodes), but the index on :A(text) (1 million nodes) fails immediately and eventhough i have also tried it with the python driver with the lowest debug level and looked through /var/log/neo4j/.. i can't find any indication as to why this is failing. Judging by the documentation the fact that :A(text) is not unique should not be a problem. Does anyone have hunch where something could be wrong?
EDIT:
Neo4j Version: 3.4.8
Longest :A(text) string: 17 chars (individual words)
Longest :B(text) string: 53 chars (short phrases)
EDIT: I realize fails immediately can be a bit of a vague term, so here exactly what happens when using the browser interface...
create index on :B(text)
returns
Added 1 index, completed after 1 ms.
immediately after
:schema
returns
Indexes
ON :B(text) POPULATING
after a while
:schema
returns
Indexes
ON :B(text) ONLINE
now for A
create index on :A(text)
returns
Added 1 index, completed after 1 ms.
immediately after
:schema
returns
Indexes
ON :B(text) ONLINE
ON :A(text) FAILED

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

Neo4J Write: CPU at 100% when there is a lot of write

I ran around 3k write queries in around 1 minutes, the CPU hits 100%.
Here is the jstack log:
jstack when CPU at 100%.
Can anyone tell me what is going on from the jstack logs,so that I can optimize my writes?
I am using Node.js Neo4J client(runs on m3.xlarge AWS instance) to write my changes.
Thank you.
Your trace looks ok, it is just a few threads busy reading things.
It could be garbage collection induced CPU spikes or something else that's not visible in the stacks.
Can you share the (type of) statements you run?
For your queries:
only merge on one label
make sure to have an index / constraint for each :Label(property) that you merge or match on
if you match on a property always have a :Label and an index for it:
you might also want to add a generic :Node label if you are working with generic guids all the time
create index on :Node(guid);
create index on :Book(id);
'MERGE (u:Node{guid:{guid}})',
'SET u.name={name}, u:Book'
'MERGE (u:Node {guid:{guid}})',
'SET u.name={name}, u.sub_type={sub_type}, u:Home:Area'
// are you sure you mean :Book(id) not :Book(guid) ?
'MATCH ( e:Node {guid:{guid}} ), (m:Book{id:{id}})',
'MERGE (e)<-[r:MEMBER]-(m)',
'return r'

Neo4j: Best way to batch relate nodes using Cypher?

When I run a script that tries to batch merge all nodes a certain types, I am getting some weird performance results.
When merging 2 collections of nodes (~42k) and (~26k), the performance is nice and fast.
But when I merge (~42) and (5), performance DRAMATICALLY degrades. I'm batching the ParentNodes (so (~42k) split up in batches of 500. Why does performance drop when I'm, essentially, merging less nodes (when the batch set is the same, but the source of the batch set is high and the target set is low)?
Relation Query:
MATCH (s:ContactPlayer)
WHERE has(s.ContactPrefixTypeId)
WITH collect(s) AS allP
WITH allP[7000..7500] as rangedP
FOREACH (parent in rangedP |
MERGE (child:ContactPrefixType
{ContactPrefixTypeId:parent.ContactPrefixTypeId}
)
MERGE (child)-[r:CONTACTPLAYER]->(parent)
SET r.ContactPlayerId = parent.ContactPlayerId ,
r.ContactPrefixTypeId = child.ContactPrefixTypeId )
Performance Results:
Process Starting
Starting to insert Contact items
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++]
Total time for 42149 Contact items: 19176.87ms
Average time per batch (500): 213.4ms
Longest batch time: 663ms
Starting to insert ContactPlayer items
[++++++++++++++++++++++++++++++++++++++++++++++++++++++++]
Total time for 27970 ContactPlayer items: 9419.2106ms
Average time per batch (500): 167.75ms
Longest batch time: 689ms
Starting to relate Contact to ContactPlayer
[++++++++++++++++++++++++++++++++++++++++++++++++++++++++]
Total time taken to relate Contact to ContactPlayer: 7907.4877ms
Average time per batch (500): 141.151517857143ms
Longest batch time: 883.0918ms for Batch number: 0
Starting to insert ContactPrefixType items
[+]
Total time for 5 ContactPrefixType items: 22.0737ms
Average time per batch (500): 22ms
Longest batch time: 22ms
Already inserted data for Contact.
Starting to relate ContactPrefixType to Contact
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++]
Total time taken to relate ContactPrefixType to Contact: 376540.8309ms
Average time per batch (500): 4429.78643647059ms
Longest batch time: 14263.1843ms for Batch number: 63
So far, the best I could come up with is the following (and it's a hack, specific to my environment):
If / Else condition:
If childrenNodes.count() < 200 -> assume they are type identifiers for the parent... i.e. ContactPrefixType
Else assume it is a matrix for relating multiple item types together (i.e. ContactAddress)
If childNodes < 200
MATCH (parent:{parentLabel}),
(child:{childLabel} {{childLabelIdProperty}:parent.{parentRelationProperty}})
CREATE child-[r:{relationshipLabel}]->parent
This takes about 3-5 seconds to complete per relationship type
Else
MATCH (child:{childLabel}),
(parent:{parentLabel} {{parentPropertyField : child.{childLabelIdProperty}})
WITH collect(parent) as parentCollection, child
WITH parentCollection[{batchStart}..{batchEnd}] as coll, child
FOREACH (parent in coll |
CREATE child-[r:{relationshipLabel}]-parent )
I'm not sure this is the most efficient way of doing this, but after trying MANY different options, this seems to be the fastest.
Stats:
insert 225,018 nodes with 2,070,977 properties
create 464,606 relationships
Total: 331 seconds.
Because this is a straight import and I'm not dealing with updates yet, I assume that all the relationships are correct and don't need to worry about invalid data... however, I will try to set properties to the relationship type so as to be able to perform cleanup functions later (i.e. store the parent and child Id's in the relationship type as properties for later reference)
If anyone can improve on this, I would love it.
Can you pass the ids in as parameters rather than fetch them from the graph? The query could look like
MATCH (s:ContactPlayer {ContactPrefixTypeId:{cptid})
MERGE (c:ContactPrefixType {ContactPrefixTypeId:{cptid})
MERGE c-[:CONTACT_PLAYER]->s
If you use the REST API Cypher resource, I think the entity should look something like
{
"query":...,
"params": {
"cptid":id1
}
}
If you use the transactional endpoint, it should look something like this. You control transaction size by the number of statements in each call, and also by the number of calls before you commit. More here.
{
"statements":[
"statement":...,
"parameters": {
"cptid":id1
},
"statement":...,
"parameters": {
"cptid":id2
}
]
}

Resources