Neo4j 1.9.5 (using spring data neo4j) query time slow - neo4j

I have a neo4j community edition 1.9.5 running in an ec2 m1.medium instance (around 4gb ram) . I have around 300 nodes, 800 relationships and some 2000 properties. Neo4j is running in REST mode. Below is my applicationContext.xml :
<beans profile="default">
<bean class="org.springframework.data.neo4j.rest.SpringRestGraphDatabase" id="graphDatabaseService">
<constructor-arg index="0" value="http://localhost:7474/db/data/"/>
</bean>
<neo4j:config graphDatabaseService="graphDatabaseService"/>
</beans>
Now, I have this query below, which shows all the movies which your friends have liked, which takes like ~10 seconds to return ! :
start user=node(*)
match user-[friend_rela:FRIENDS]-friend,friend-[movie_rela:LIKE]->movie
where has(user.uid) and user.uid={0}
return distinct movie,movie_rela,friend
order by movie_rela.timeStamp desc
skip {1} " +
limit {2}
I have indexed the following things:
My Indexes in the adming UI shows I have indexed the following :
Nodes:
movieId (from Movie)
__types__
Movie
uid (from User)
User
Relationships:
IsFriends
Like
__rel_types__
timeStamp
I have also changed the neo4j-wrapper.conf file to have the following heap sizes
# Initial Java Heap Size (in MB)
wrapper.java.initmemory=512
# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=2000
Do you think I am missing anything. Wondering why it takes so long ! Please advise.
Thanks

Your query is horrible inefficient, it basically traverses the full graph multiple times. You should use a index lookup to find the start point for your query and then traverse from the start point(s) - so you're doing a local query instead of a global one:
start user=node:Movie(uid={0})
match (user)-[friend_rela:FRIENDS]-(friend)-[movie_rela:LIKE]->(movie)
return distinct movie,movie_rela,friend
order by movie_rela.timeStamp desc
skip {1} limit {2}

Related

Best practices for improving Full Text Search in Neo4j, Apollo Server and Graphql

I am currently trying to build a search application on top of SNOMED CT international FULL RF 2 Release. The database is quite huge and I had decided to move on with a full text search index for optimal results. So there are primarily 3 types of nodes in a SNOMED CT database :
ObjectConcept
Descriptions
Role Group
There are multiple relationships between the nodes but I'm focussing on that later.
Currently I'm focussing on searching for ObjectConcept Nodes by a String property value called FSN which stands for fully specified name. For this I tried two things :
Create text indexes on FSN : After using this with MATCH queries the results were rather slow even when I was using the CONTAINS predicate and limiting return value to 15.
Create Full Text Indexes: According to the docs, the FT indexes are powered by Apache Lucene. I created this for FSN. After using the FT Indexes and using AND clauses in the search term, for example :
Search Term : head pain
Query Term: head AND pain
I observe quite impressive benefits in query time using the profiler in neo4j browser(around from 43ms to 10ms for some queries) however once I start querying the db using the apollo server, query times go as high as 2s - 3s. Sometimes upto 8-10s.
The query is as follows, implemented by a custom resolver in neo4j/graphql and apollo-server:
` const session = context.driver.session()
 let words = args.name.split(" ");
 let compoundQuery = "";
 if (words.length === 1) compoundQuery = words[0];
 else compoundQuery = words.join(" AND ");
 console.log(compoundQuery)
 compoundQuery+= AND (${args.type})
 return session
    .run(
   
    `CALL db.index.fulltext.queryNodes('searchIndex',$name) YIELD node, score
    RETURN node
    LIMIT 10
    `,
    { name: compoundQuery }
    )
  .then((res) => {
  session.close()
  return res.records.map((record) => {
  return record.get('node').properties
   })
  })
} `
I have the following questions:
Am I utilising FT indexes as much as I can or am I missing important optimisations out?
I was trying to implement elasticsearch with neo4j but I read elasticsearch and the FT indexes are both powered by lucene. So am I likely to gain improvements from using elasticsearch? If so? how should I go about it considering that I am using neo4j aura db and my graphql server is on ec2. I am confused as to how to use elasticsearch overall with the GRANDStack. Any help would be appreciated.
Any other Suggestions for optimising search would be greatly appreciated.
Thanks in advance!

Simple Neo4j query is very slow on large database

I have a Neo4J database with the following properties:
Array Store 8.00 KiB
Logical Log 16 B
Node Store 174.54 MiB
Property Store 477.08 MiB
Relationship Store 3.99 GiB
String Store Size 174.34 MiB
MiB Total Store Size 5.41 GiB
There are 12M nodes and 125M relationships.
So you could say this is a pretty large database.
My OS is windows 10 64bit, running on an Intel i7-4500U CPU #1.80Ghz with 8GB of RAM.
This isn't a complete powerhouse, but it's a decent machine and in theory the total store could even fit in RAM.
However when I run a very simple query (using the Neo4j Browser)
MATCH (n {title:"A clockwork orange"}) RETURN n;
I get a result:
Returned 1 row in 17445 ms.
I also used a post request with the same query to http://localhost:7474/db/data/cypher, this took 19seconds.
something like this:
http://localhost:7474/db/data/node/15000
is however executed in 23ms...
And I can confirm there is an index on title:
Indexes
ON :Page(title) ONLINE
So anyone have ideas on why this might be running so slow?
Thanks!
This has to scan all nodes in the db - if you re-run your query using n:Page instead of just n, it'll use the index on those nodes and you'll get better results.
To expand this a bit more - INDEX ON :Page(title) is only for nodes with a :Page label, and in order to take advantage of that index your MATCH() needs to specify that label in its search.
If a MATCH() is specified without a label, the query engine has no "clue" what you're looking for so it has to do a full db scan in order to find all the nodes with a title property and check its value.
That's why
MATCH (n {title:"A clockwork orange"}) RETURN n;
is taking so long - it has to scan the entire db.
If you tell the MATCH() you're looking for a node with a :Page label and a title property -
MATCH (n:Page {title:"A clockwork orange"}) RETURN n;
the query engine knows you're looking for nodes with that label, it also knows that there's an index on that label it can use - which means it can perform your search with the performance you're looking for.

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

Neo4j 1.9.9 legacy index very slow after deletes

Using Neo4j 1.9.9.
Some Cypher queries we were running seemed to be unreasonably slow. Some investigation showed that:
Delete 200k nodes takes about 2-3 seconds on my hardware (MacBook Pro), when I select them using:
START n=node(*) DELETE n
Adding a WHERE clause does not significantly slow it down
If the nodes were selected using an index, it has similar performance, e.g.
START n=node:__types__(className="com.e2sd.domain.Comment") DELETE n
Except that when repeating the previous test, it is 20x or more slower, with actual time varying from 80 to several hundred seconds. Even more curious, it doesn't matter whether I repeat the test in the same JVM or start a new program, or clear out all the nodes in the database and verify it has zero nodes. The index-based delete is extremely slow on any subsequent run of the test until I clobber my neo4j data directory with
rm -R target/neo4j-test/
I'll give some example Scala code here. I'm happy to provide more detail as required.
for (j <- 1 to 3) {
log("Total nodes in database: " + inNeo4j( """ START n=node(*) RETURN COUNT(n) """).to(classOf[Int]).single)
log("Start")
inNeo4j(""" CREATE (x) WITH x FOREACH(i IN RANGE(1, 200000, 1) : CREATE ({__type__: "com.e2sd.domain.Comment"})) """)
rebuildTypesIndex()
log("Created lots of nodes")
val x = inNeo4j(
"""
START n=node:__types__(className="com.e2sd.domain.Comment")
DELETE n
RETURN COUNT(n)
""").to(classOf[Int]).single
log("Deleted x nodes: " + x)
}
// log is a convenience method that prints a string and the time since the last log
// inNeo4j is a convenience method to run a Cypher query
def rebuildTypesIndex(): Unit = {
TransactionUtils.withTransaction(neo4jTemplate) {
log.info("Rebuilding __types__ index...")
val index = neo4jTemplate.getGraphDatabase.getIndex[Node]("__types__")
for (node <- GlobalGraphOperations.at(neo4jTemplate.getGraphDatabaseService).getAllNodes.asScala) {
index.remove(node)
if (node.hasProperty("__type__")) {
val typeProperty = node.getProperty("__type__")
index.add(node, "className", typeProperty)
}
}
log.info("Done")
}
}
We are using Neo4j embedded here with the following Spring Data configuration.
<bean id="graphDbFactory" class="org.neo4j.graphdb.factory.GraphDatabaseFactory"/>
<bean id="graphDatabaseService" scope="singleton" destroy-method="shutdown"
factory-bean="graphDbFactory" factory-method="newEmbeddedDatabase">
<constructor-arg value="target/neo4j-test"/>
</bean>
<neo4j:config graphDatabaseService="graphDatabaseService" base-package="my.package.*"/>
Why is the DELETE query slow under the conditions described?
You have to specifically delete entries from the legacy index, deleting nodes is not enough to make it remove from a legacy index. Thus, when you run it the second time, you have 400k entries in your index, even though half of them point to deleted nodes. In this way your program is slow as repeated runs extend the size of the index.
I had this problem when I wrote an extension to neo4j spatial to bulk load the RTree. I had to use the Java API you had to explicitly delete from the index separately from deleting the node. Glad I could help.

neo4j REST 'Server got itself in trouble'

I am running a very basic test to check my understanding and evaluate neo4j REST server (neo4j-community-1.8.M07). I am using Neo4j Python REST Client.
Each test iteration starts with a random strings for the source node name and the destination node name. The names contain only letters a..z and numbers 0..9 (oddly enough, I never got it to fail if I use A..Z and 0..9). The name may be from one char to 36 chars long and there are no repeating chars. I create 36 nodes, where the 1-st node name is only one char long and the 36-th node name has 36 chars. Then I create relations between all nodes. The name of each relation is the concatenation of the source node name and the destination node name. The final graph has 37 nodes (1 reference node and 36 nodes with names from one char to 36 non-repeating chars) and 1260 relations. Before each test iteration I clear the graph, so that it has only one (the reference) node.
The problem is that after several successful iterations neo4j REST server crashes:
Error [500]: Internal Server Error. Server got itself in trouble.
Invalid data sent
The query that crashes the system can be different - here is an example of a query_string that caused a problem:
START n_from=node:index_faqts(node_name="h"),
n_to=node:index_faqts(node_name="hg2b8wpj04ms")CREATE UNIQUE
n_from-[r:`hhg2b8wpj04ms` ]->n_to RETURN r
self.cypher_extension.execute_query( query_string )
I spent a lot of time trying to find a trend, but in vain. If I did something wrong with the queries none of the tests would ever work. I have observed crashes for number of successful test cycles between 5 and 25 rounds.
What might be causing neo4j REST server to crash?
P.S. Some details...
The nodes are created like this:
...
self.index_faqts[ "node_name" ][ p_str_node_name ] =
self.gdb.nodes.create( **p_dict_node_attributes )
...
Just in case - before issuing the query to create a new relation I check the graph to make sure that the
source and the destination nodes exist. That check never failed.
You are using too many relationship-types, currently the limit is at 32k. Might be patched in Neo4j if you have a valid use-case.

Resources