UPDATED: Wes hit a home run here! Thanks.. I've added a Rails version I was developing using the neography Gem.. Accomplishes the same thing but his version is much faster.. see comparison below
I am using a linked list in Neo4j (1.9, REST, Cypher) to help keep the comments in proper order (Yes I know I can sort on the time etc).
(object node)---[:comment]--->(comment)--->(comment)--->(comment).... etc
Currently I have 900 comments and it's taking 7 seconds to get through the whole list - completely unacceptable.. I'm just returning the ID of the node (I know, don't do this, but it's not he point of my post).
What I'm trying to do is find the ID's of users who commented so I can return a count.. (like "Joe and 405 others commented on your post").. Now, I'm not even counting the unique nodes at this point - I'm just returning the author_id for each record.. (I'll worry about counting later - first take care of the basic performance issue).
start object=node(15837) match object-[:COMMENTS*]->comments return comments.author_id
7 seconds is waaaay too long..
Instead of using a linked list, I could just simply have an object and link all the comments directly to the node - but that could lead to a supernode that is just bogged down, and then finding the most recent comments, even with skip and limit, will be dog slow..
Will relationship indexes help here? I've never used them other than to ensure a unique relationship, or to see if a relationship exists, but can I use them in a cypher query to help speed things up?
If not, what else can I do to decrease the time it takes to return the IDs?
COMPARISON: Here is the Rails version using "phase II" methods of the Neography gem:
next_node_id=18233
#neo=Neography::Rest.new
start_node = Neography::Node.load(next_node_id, #neo)
all_nodes=start_node.outgoing(:COMMENTS).depth(10000)
raise all_nodes.size.to_i
Result: 526 nodes found in 290ms..
Wes' solution took 5 ms.. :-)
Relationship indexes will not help. I'd suggest using an unmanaged extension and the traversal API--it will be a lot faster than Cypher for this particular query on long lists. This example should get you close:
https://github.com/wfreeman/linkedlistlength
I based it on Mark Needham's example here:
http://www.markhneedham.com/blog/2014/07/20/neo4j-2-1-2-finding-where-i-am-in-a-linked-list/
If you're only doing this to return a count, the best solution here is to not figure it out on every query since it isn't changing that often. Cache the results on the node in a total_comments property to your node. Every time a relationship is added or removed, update that count. If you want to know whether any of the current user's friends commented on it so you can say, "Joe and 700 others commented on this," you could do a second query:
start joe=node(15830) object=node(15838) match joe-[:FRIENDS]->friend-[:POSTED_COMMENT]->comment<-[:COMMENTS]-object RETURN friend LIMIT 1
You limit it to 1 since you only need the name of one friend who commented. If it returns someone, adjust the number of comments displayed by 1, include the user's name. You could do that with JS so it doesn't delay your page load. Sorry if my Cypher is a little off, not used to <2.0 syntax.
Related
why neo4j order by is very slow for large database :(
here is the example query:
PROFILE MATCH (n:Item) RETURN n ORDER BY n.name Desc LIMIT 25
and in result it's read all records but i already used index on name property.
here is the result
Click here to see results
it reads all nodes, it's real mess for large number of records.
any solution for this?
or neo4j is not good choice too for us :(
and any way to get last record from nodes?
Your question and problem are not very clear.
1) Are you sure that you added the index correctly?
CREATE INDEX ON :Item(name)
In the Neo4j browser execute :schema to see all your indexes.
2) How many Items does your database hold and what running time are you expecting and achieving?
3) What do you mean by 'last record from nodes'?
Indexes are currently only used to find entry points into the graph, but not for other uses including ordering of results.
Indexed-backed ORDER BY operations have been a highly requested feature for awhile, and while we've been tracking and ordering its priority, we've had several other features that took priority over this work.
I believe indexed-backed ORDER BY operations are currently scheduled very soon, for our 3.5 release coming in the last few months of 2018.
We have items in our app that form a tree-like structure. You might have a pattern like the following:
(c:card)-[:child]->(subcard:card)-[:child]->(subsubcard:card) ... etc
Every time an operation is performed on a card (at any level), we'd like to record it. Here are some possible events:
The title of a card was updated by Bob
A comment was added by Kate mentioning Joe
The status of a card changed from pending to approved
The linked list approach seems popular but given the sorts of queries we'd like to perform, I'm not sure if it works the best for us.
Here are the main queries we will be running:
All of the activity associated with a particular card AND child cards, sorted by time of the event (basically we'd like to merge all of these activity feeds together)
All of the activity associated with a particular person sorted by time
On top of that we'd like to add filters like the following:
Filter by person involved
Filter by time period
It is also important to note that cards may be re-arranged very frequently. In other words, the parents may change.
Any ideas on how to best model something like this? Thanks!
I have a couple of suggestions, but I would suggest benchmarking them.
The linked list approach might be good if you could use the Java APIs (perhaps via an unmanaged extension for Neo4j). If the newest event in the list were the one attached to the card (and essentially the list was ordered by the date the events happened down the line), then if you're filtering by time you could terminate early when you've found an event which is earlier than the specified time.
Attaching the events directly to the card has the potential to lead you down into problems with supernodes/dense nodes. It would be the simplest to query for in Cypher, though. The problem is that Cypher will look at all of them before filtering. You could perhaps improve the performance of queries by, in addition to placing the date/time of the event on the event node, placing it on the relationships to the node ((:Card)-[:HAS_EVENT]->(:Event) or (:Event)-[:PERFORMED_BY]->(:Person)). Then when you query you can filter by the relationships so that it doesn't need to traverse to the nodes.
Regardless, it would probably be helpful to break up the query like so:
MATCH (c:Card {uuid: 'id_here')-[:child*0..]->(child:Card)
WITH child
MATCH (child)-[:HAS_EVENT]->(event:Event)
I think that would mean that the MATCH is going to have fewer permutations of paths that it will need to evaluate.
Others are welcome to supplement my dubious advice as I've never really dealt with supernodes personally, just read about them ;)
I have identified that some queries happen to return less results than expected. I have taken one of the missing results and tried to force Neo4j to return this result - and I succeeded with the following query:
match (q0),(q1),(q2),(q3),(q4),(q5)
where
q0.name='v4' and q1.name='v3' and q2.name='v5' and
q3.name='v1' and q4.name='v3' and q5.name='v0' and
(q1)-->(q0) and (q0)-->(q3) and (q2)-->(q0) and (q4)-->(q0) and
(q5)-->(q4)
return *
I have supposed that the following query is semantically equivalent to the previous one. However in this case, Neo4j returns no result at all.
match (q1)-->(q0), (q0)-->(q3), (q2)-->(q0), (q4)-->(q0), (q5)-->(q4)
where
q0.name='v4' and q1.name='v3' and q2.name='v5' and
q3.name='v1' and q4.name='v3' and q5.name='v0'
return *
I have also manually verified that the required edges among vertices v0, v1, v3, v4 and v5 are present in the database with right directions.
Am I missing some important difference between these queries or is it just a bug of Neo4j? (I have tested these queries on Neo4j 2.1.6 Community Edition.)
Thank you for any advice
/EDIT: Updating to newest version 2.2.1 was of no help.
This might not be a complete answer, but here's what I found out.
These queries aren't synonymous, if I understand correctly.
First of all, use EXPLAIN (or even PROFILE) to look under the hood. The first query will be executed as follows:
The second query:
As you can see (even without going deep down), those are different queries in terms of both efficiency and semantics.
Next, what's actually going on here:
the 1st query will look through all (single) nodes, filter them by name, then - try to group them according to your pattern, which will involve computing Cartesian product (hence the enormous space complexity), then collect those groups into the larger ones, and then evaluate your other conditions.
the 2nd query will first pick a pair of nodes connected with some relationship (which satisfy the condition on the name property), then throw in the third node and filter again, ..., and so on till the end. The number of nodes is expected to decrease after every filter cycle.
By the way, is it possible that you accidentally set the same name twice (for q1 and q3?)
I'm toying around with Neo4J. My Data consists of users who own objects which are tagged by tags. There my schema looks like:
(:User)-[:OWNS]->(:Object)-[:TAGGED_AS]->(:Tag)
I have written a script that generates me a sample Graph. Currently I have 100 User, ~2500 Tag and ~10k Object nodes in the database. Between those I have ~700k relationships. I know want to find every Object that is not owned by a certain User but related over a Tag the User has used himself. There query looks like:
MATCH (user:User {username: 'Cristal'})
WITH user
MATCH (user)-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
However, this query runs ~1-5 minutes (depending on the user and how many objects he owns), which is a not only a bit to slow. What am I doing wrong? I consider this a rather "trivial" query against a graph of modest size. I'm using Neo4J 2.1.6 Community and already set the Java Heap to 2000 MB (and I can see that there is a Java process using this much). Am I missing an index or something like that (I'm new to Neo4J)?
I honestly expected the result to be pretty much instant especially considering that the Neo4J docs mention the I should use a heap between 1 and 4 GB for 100 million objects...and I'm only close to a 1/100 of this number.
If it is my Query (which I hope and expect) how can I improve it? What is something you have to be aware when writing queries?
Do you have an index on the username property?
CREATE INDEX ON :User(username)
Also you don't really need that WITH there, so maybe drop it to see if it helps:
MATCH (user:User {username: 'Cristal'})-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
Also, I don't think it will make a different, but you can drop the obj and tag variables since you're not using them elsewhere in the query.
Also, if you're generating sample graphs you may want to check out GraphGen:
http://graphgen.neoxygen.io/
There is some code in the project I'm working on where a dynamic finder behaves differently in one code branch than it does in another.
This line of code returns all my advertisers (there are 8 of them), regardless of which branch I'm in.
Advertiser.findAllByOwner(ownerInstance)
But when I start adding conditions, things get weird. In branch A, the following code returns all of my advertisers:
Advertiser.findAllByOwner(ownerInstance, [max: 25])
But in branch B, that code only returns 1 advertiser.
It doesn't seem possible that changes in application code could affect how a dynamic finder works. Am I wrong? Is there anything else that might cause this not to work?
Edit
I've been asked to post my class definitions. Instead of posting all of it, I'm going to post what I think is the important part:
static mapping = {
owner fetch: 'join'
category fetch: 'join'
subcategory fetch: 'join'
}
static fetchMode = [
grades: 'eager',
advertiserKeywords: 'eager',
advertiserConnections: 'eager'
]
This code was present in branch B but absent from branch A. When I pull it out, things now work as expected.
I decided to do some more digging with this code present to see what I could observe. I found something interesting when I used withCriteria instead of the dynamic finder:
Advertiser.withCriteria{owner{idEq(ownerInstance.id)}}
What I found was that this returned thousands of duplicates! So I tried using listDistinct:
Adviertiser.createCriteria().listDistinct{owner{idEq(ownerInstance.id)}}
Now this returns all 8 of my advertisers with no duplicates. But what if I try to limit the results?
Advertiser.createCriteria().listDistinct{
owner{idEq(ownerInstance.id)}
maxResults 25
}
Now this returns a single result, just like my dynamic finder does. When I cranked maxResults upto 100K, now I get all 8 of my results.
So what's happening? It seems that the joins or the eager fetching (or both) generated sql that returned thousands of duplicates. Grails dynamic finders must return distinct results by default, so when I wasn't limiting them, I didn't notice anything strange. But once I set a limit, since the records were ordered by ID, the first 25 records would all be duplicate records, meaning that only one distinct record will be returned.
As for the joins and eager fetching, I don't know what problem that code was trying to solve, so I can't say whether or not it's necessary; the question is, why does having this code in my class generate so many duplicates?
I found out that the eager fetching was added (many levels deep) in order to speed up the rendering of certain reports, because hundreds of queries were being made. Attempts were made to eager fetch on demand, but other developers had difficulty going more than one level deep using finders or Grails criteria.
So the general answer to the question above is: instead of eager by default, which can cause huge nightmares in other places, we need to find a way to do eager fetching on a single query that can go more than one level down the tree
The next question is, how? It's not very well supported in Grails, but it can be achieved by simply using Hibernate's Criteria class. Here's the gist of it:
def advertiser = Advertiser.createCriteria()
.add(Restrictions.eq('id', advertiserId))
.createCriteria('advertiserConnections', CriteriaSpecification.INNER_JOIN)
.setFetchMode('serpEntries', FetchMode.JOIN)
.uniqueResult()
Now the advertiser's advertiserConnections, will be eager fetched, and the advertiserConnections' serpEntries will also be eager fetched. You can go as far down the tree as you need to. Then you can leave your classes lazy by default - which they definitely should be for hasMany scenarios.
Since your query are retrieving duplicates, there's a chance that this limit of 25 records return the same data, consequently your distinct will reduce to one record.
Try to define the equals() and hashCode() to your classes, specially if you have some with composite primary key, or is used as hasMany.
I also suggest you to try to eliminate the possibilities. Remove the fetch and the eager one by one to see how it affects your result data (without limit).