Neo4j modelling advice - neo4j

I am developing a realtime chat for people selling/buying items
I would like to know what is the most performant way to implement in Neo4j the storing of the messages in a room. (I can see 2 options)
1) add a messages array property to the Room node.
2) make the messages as nodes and have a "NEXT" relation between them.
What option would be the most performant for Neo4j ?
Would just adding a value to the messages array would be easier to deal with for Neo4j ?

From a performance point of view, the costs of the operations with Neo4j are as follows:
Find a node: O(1)
Transverse a relationship: O(1)
If you store every message in a single node, you only have to find one node, so the total cost of the operation is O(1) (constant)
But if you store every message in its own node with a relationship of NEXT between each message, to extract N messages, you need to find N nodes, so the cost becomes N * 2 * O(1) = O(N) (linear, and 2 because, 1 for finding, and 1 for transversing)
So with this in mind, seems that the having all the messages in a single node is better, but of course, the base cost of getting a node with a lot of information in it, might take a bit longer than getting a node that is smaller, so in order to make sure, I'd suggest to measure the time it takes to load a node with all of the messages in it, with different sizes to see how it scales, and then you can decide:
If it scales in a linear way => both will have similar performance
If doesnt scale linearly:
less than linear => one node with all the messages will be better
more than linear => a node per message will be better.
I suspect it will be less than linear, but assumptions aren't a good guide, so better check it.
If you are using Java 8 in your app, one way you can measure the operation time is using:
Instant start = Instant.now();
// operation
Instant end = Instant.now();
Duration.between(start,end);

Related

How to explain improving speed with larger data in Neo4j Cypher?

I make experiments with querying Neo4j data whose size gradually increases. Some of my query expressions behave as I would expect - a larger graph means a longer execution time. But, e.g., the query
MATCH (`g`:Graph)
MATCH (`g`)<-[:`Graph`]-(person)-[:birthPlace {Value: "http://persons/data/birthPlace"}]->(place)-[:`Graph`]->(`g`)
WITH count(person.Value) AS persons, place
WHERE persons > 5
RETURN place.Value AS place, persons
ORDER BY persons
has these execution times (in ms):
|80.2 |208.1 |301.7 |399.23 |0.1 |2.07 |2.61 |2.81 |7.3 |1.5 |.
How to explain the rapid acceleration from the fifth step? The data are the same, just extended; no indexes were created.
The data on 4th experiment:
201673 nodes,
601189 relationships,
859225 properties.
The data size on the 5th experiment:
242113 nodes,
741500 relationships,
1047060 properties.
All I can think about is that maybe Cypher will start using some indexes from a certain data size, but I can't find anywhere if that's the case.
Thank you for any comments.
Neo4j cache management may explain your observations. You might explain what you are doing more precisely. What version of Neo4j are you using? What is the Graph node? You are repeatedly running the same query on graph and trying this again with a larger or smaller graph?
If you are running the same query multiple times on the same data set with more rapid execution times, then the cache may be the reason. In v 3.5 and earlier it would "warm up" with repeated execution. Putatively this does not occur in v 4.x.
You might look at
cold start
or these tips. You might also look at your transaction log; is it accumulate large files.
Why the '' around your node identifiers ('g'); just use (g:Graph) and [r:Graph]; no quotes.

Neo4J using properties on relationships for quicker lookup?

I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !

Single node with properties takes forever to query

I have a 50K node graph with 10 properties per node. Each node of the same type but with different values. Each of the properties is on an index and I have increased the heap and page cache memory sizes for the database. However using the browser console, creating the nodes takes 6 minutes!
And also a query for all the properties takes a very long time (~2 minutes) to appear in the browser console but when the results do appear the bottom of the browser says that the result of 50K node properties took only 2500 ms.
How do I improve the performance importing/querying 10's of thousands of unique instances a single node with 10 properties each and no relationships?
It takes time to update 10 different indexes for each node that you create. Do you really have use cases that require an index for every single property? If not, get rid of the indexes you do not need. Remember, indexes can speed up finding the first node(s) to initiate a query, but they do not help at all when traversing paths through a graph.
If you really need all 10 indexes, then to speed up the importing step, you can: drop all the indexes, import all 50K nodes, and then create each index one at a time (which will take some time for each index). The overall time will be about the same, but the import itself should be much faster.
It takes the neo4j browser a very long time to generate and display the visualization for a very large result (e.g., 10's of thousands of nodes). The browser is not intended for viewing that much data at one time.
1) Check that you are running a recent version of Neo4j. 3+ has optimised the way that properties are stored and indexed.
2) Check how you're running the query. Maybe your query is not optimised or is problematic in some way. Note in particular that each MATCH generates a 'row'. Multiple MATCH clauses will yield the Cartesian product of all matched sets, which could be problematic with large armounts of data.
3) Check that each of these properties needs to be attached to a node. Neo4j is optimised for searching for relationships, not for properties.
Consider turning nodes that look like this:
(:Train {
maxSpeedInKPH: 350,
fuelType: 'Diesel',
numberOfEngines: 3
})
to
(:Train)
-[:USES_FUEL_TYPE]->(:Fuel {type: 'Diesel'}),
-[:HAS_MAX_SPEED]->(:MaxSpeed {value: 350, unit: 'k/h'}),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine)
There is generally a benefit to spinning properties out into relationships, even if the uniqueness is low. For example if you have a property which has a unique value per node, generally keep that in the node. But if your 50000 nodes have less, say, 25000 unique values in that property, it would probably still be beneficial to spin them out into relationships. This is absolutely the case with integer-type properties, where you'll also be able to add additional "bucket relationships" to provide a form of indexing. In the example above, the max speed was 350. After spinning the property out into a relationship, you could also put an additional relationship of the type [:HAS_MAX_SPEED_ABOVE]-> 300. This would complicate your querying, but should make it faster.
4) If none of the above apply to you, cannot be implemented or do not help, consider switching to a more traditional relational database like SQL. SQL would be a perfect candidate for your use case, i.e. 50k different nodes (rows) with only 10 different properties (columns) and no relationships (joins).

How much does the architecture of data affect the speed of a query

I have the following nodes and relationships in Neo4j database.
The grey and the pink node are furtherly connected with more nodes. Running the following query:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..3]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
I get a result very fast (the node n:RealNode is not one of the nodes in the image).
If I increase the depth to 4 like:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..4]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
The response gets extremely slow. I will never get a response with depth 5 etc.
The depth 4 is actually the relationship between the blue-pink node. So my question is: can the architecture of data (in this case) affect in such a great level the speed of the query? If yes what should I do?
I have tried to run the query also using parameters but the result was the same. Also the gid of n:RealNode is an indexed value.
The architecture of your data has a huge, no...massive impact on query performance. There's a lot you can do with improving performance by reformulating your query, but you can do even more than that by changing your data model.
The model needs to be chosen in a way that's an accurate depiction of the real-world domain, but it often also has to make certain concessions to usage patterns. If you know you're going to do certain queries over and over, it makes sense to choose a data model that makes it easy for the DBMS to answer that query. In the RDBMS world, that entire line of thinking gets summarized in the word "denormalization". In graph databases, the concept is the same but the way you go about it is different.
The thing to keep in mind when adjusting your data model is that neo4j is good at traversing relationships fast, and that with all queries, the less data you have to consider, the faster the query will go.
So in your case, I don't know how many nodes branch off of each other node by a :CONTAINS relationship, but I'm guessing that at each level of the hierarchy you have many items below it. So going from level 4 to level 5 probably doesn't just add a fixed number of additional nodes, but if say each level of the hierarchy has 3x the number of nodes as the level above, the deeper you go, the more you're multiplying how much data you have to consider. If it's 10x...then ouch.
You have many different options. One is to create short-cut relationships, and "pre-materialize" the query. Imagine creating :grandfather and :greatgrandfather relationships to "hop" levels of the tree. That would make it faster. Another way would be to filter intermediate nodes, or the return nodes, so that you're not considering everything, but some subset.
In the end, really huge queries will always take longer than really small ones. You must first begin with a careful understanding of what data you want, and how often you have to run this query. I would not attempt to optimize your data model for infrequently run queries, but if you do this all the time, you should look at your options. Your query to me looks like it's going to return a whole lot of data no matter what you do.

Neo4j get related groups

I'm not sure that title is worded very well, but not sure how else to. I'm populating a Neo4j database with some data.
The data is mainly generated from data I have about pairs of users. There is a percentage relationship between users, such as:
80
A ---> B
But the reverse relation is not the same:
60
A <--- B
I could put both relations in, but I think what I might do, is use the mean:
70
A <--> B
But I'm open to suggestions here.
What I want to do in Neo4j, is get groups of related users. So for example, I want to get a group of users with mean % relation > 50. So, if we have:
A
40 / \ 60
B --- C ------ D
20 70
We'd get back a subset, something like:
A
\ 60
C ------ D
70
I have no idea how to do that. Another thing is that I'm pretty sure it'd not be possible to reach any node from any other node, I think it's disjointed. Like, several large graphs. But I'd like to be able to get everything that falls into the above, even if some groups of nodes are completely separate from other nodes.
As an idea for numbers, there will be around 100,000 nodes and 550,000 edges
A couple thoughts.
First, it's fine if your graph isn't connected, but you need some way to access every component you want to analyze. In a disconnected graph in Neo4j, that either means Lucene indexing, some external "index" that holds node or relationship ids, or iterating through all nodes or relationships in the DB (which is slow, but might be required by your problem).
Second, though I don't know your domain, realize that sticking with the original representation and weights might be fine. Neo doesn't have undirected edges (though you can just ignore edge direction), and you might need that data later. OTOH, your modification taking the mean of the weights does simplify your analysis.
Third- depending on the size and characteristics of your graph, this could be very slow. It sounds like you want all connected components in the subgraph built from the edges with a weight greater 50. AFAIK, that requires an O(N) operation over your database, where N is the number of edges in the DB. You could iterate over all edges in your DB, filter based on their weight, and then cluster from there.
Using Gremlin/Groovy, you can do this pretty easily. Check out the Gremlin docs.
Another approach might be some sort of iterative clustering as you insert your data. It occurs to me that this could be a significant real-word performance improvement, though it doesn't get around the O(N) necessity.
Maybe something like http://tinyurl.com/c8tbth4 is applicable here?
start a=node(*) match p=a-[r]-()-[t]-() where r.percent>50 AND t.percent>50 return p, r.percent, t.percent

Resources