Neo4j get related groups - neo4j

I'm not sure that title is worded very well, but not sure how else to. I'm populating a Neo4j database with some data.
The data is mainly generated from data I have about pairs of users. There is a percentage relationship between users, such as:
80
A ---> B
But the reverse relation is not the same:
60
A <--- B
I could put both relations in, but I think what I might do, is use the mean:
70
A <--> B
But I'm open to suggestions here.
What I want to do in Neo4j, is get groups of related users. So for example, I want to get a group of users with mean % relation > 50. So, if we have:
A
40 / \ 60
B --- C ------ D
20 70
We'd get back a subset, something like:
A
\ 60
C ------ D
70
I have no idea how to do that. Another thing is that I'm pretty sure it'd not be possible to reach any node from any other node, I think it's disjointed. Like, several large graphs. But I'd like to be able to get everything that falls into the above, even if some groups of nodes are completely separate from other nodes.
As an idea for numbers, there will be around 100,000 nodes and 550,000 edges

A couple thoughts.
First, it's fine if your graph isn't connected, but you need some way to access every component you want to analyze. In a disconnected graph in Neo4j, that either means Lucene indexing, some external "index" that holds node or relationship ids, or iterating through all nodes or relationships in the DB (which is slow, but might be required by your problem).
Second, though I don't know your domain, realize that sticking with the original representation and weights might be fine. Neo doesn't have undirected edges (though you can just ignore edge direction), and you might need that data later. OTOH, your modification taking the mean of the weights does simplify your analysis.
Third- depending on the size and characteristics of your graph, this could be very slow. It sounds like you want all connected components in the subgraph built from the edges with a weight greater 50. AFAIK, that requires an O(N) operation over your database, where N is the number of edges in the DB. You could iterate over all edges in your DB, filter based on their weight, and then cluster from there.
Using Gremlin/Groovy, you can do this pretty easily. Check out the Gremlin docs.
Another approach might be some sort of iterative clustering as you insert your data. It occurs to me that this could be a significant real-word performance improvement, though it doesn't get around the O(N) necessity.

Maybe something like http://tinyurl.com/c8tbth4 is applicable here?
start a=node(*) match p=a-[r]-()-[t]-() where r.percent>50 AND t.percent>50 return p, r.percent, t.percent

Related

Neo4j modelling advice

I am developing a realtime chat for people selling/buying items
I would like to know what is the most performant way to implement in Neo4j the storing of the messages in a room. (I can see 2 options)
1) add a messages array property to the Room node.
2) make the messages as nodes and have a "NEXT" relation between them.
What option would be the most performant for Neo4j ?
Would just adding a value to the messages array would be easier to deal with for Neo4j ?
From a performance point of view, the costs of the operations with Neo4j are as follows:
Find a node: O(1)
Transverse a relationship: O(1)
If you store every message in a single node, you only have to find one node, so the total cost of the operation is O(1) (constant)
But if you store every message in its own node with a relationship of NEXT between each message, to extract N messages, you need to find N nodes, so the cost becomes N * 2 * O(1) = O(N) (linear, and 2 because, 1 for finding, and 1 for transversing)
So with this in mind, seems that the having all the messages in a single node is better, but of course, the base cost of getting a node with a lot of information in it, might take a bit longer than getting a node that is smaller, so in order to make sure, I'd suggest to measure the time it takes to load a node with all of the messages in it, with different sizes to see how it scales, and then you can decide:
If it scales in a linear way => both will have similar performance
If doesnt scale linearly:
less than linear => one node with all the messages will be better
more than linear => a node per message will be better.
I suspect it will be less than linear, but assumptions aren't a good guide, so better check it.
If you are using Java 8 in your app, one way you can measure the operation time is using:
Instant start = Instant.now();
// operation
Instant end = Instant.now();
Duration.between(start,end);

Neo4j graph modelling performance and querability, property to a node or as separate node plus relationship

I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.

Seeking Neo4J Cypher query for long but (nearly) unique paths

We have a Neo4J database representing an evolutionary process with about 100K nodes and 200K relations. Nodes are individuals in generations, and edges represent parent-child relationships. The primary goal is to be able to take one or nodes of interest in the final generation, and explore their evolutionary history (roughly, "how did we get here?").
The "obvious" first query to find all their ancestors doesn't work because there are just too many possible ancestors and paths through that space:
match (a)-[:PARENT_OF*]->(c {is_interesting: true})
return distinct a;
So we've pre-processed the data so that some edges are marked as "special" such that almost every node has at most one "special" parent edge, although occasionally both parent edges are marked as "special". My hope, then, was that this query would (efficiently) generate the (nearly) unique path along "special" edges:
match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true})
return distinct a;
This, however, is still unworkably slow.
This is frustrating because "as a human", the logic is simple: Start from the small number of "interesting" nodes (often 1, never more than a few dozen), and chase back along the almost always unique "special" edges. Assuming a very low number of nodes with two "special" parents, this should be something like O(N) where N is the number of generations back in time.
In Neo4J, however, going back 25 steps from a unique "interesting" node where every step is unique, however, takes 30 seconds, and once there's a single bifurcation (where both parents are "special") it gets worse much faster as a function of steps. 28 steps (which gets us to the first bifurcation) takes 2 minutes, 30 (where there's still only the one bifurcation) takes 6 minutes, and I haven't even thought to try the full 100 steps to the beginning of the simulation.
Some similar work last year seemed to perform better, but we used a variety of edge labels (e.g., (a)-[:SPECIAL_PARENT_OF*]->(c) as well as (a)-[:PARENT_OF*]->(c)) instead of using data fields on the edges. Is querying on relationship field values just not a good idea? We have quite a few different values attached to a relationship in this model (some boolean, some numeric) and we were hoping/assuming we could use those to efficiently limit searches, but maybe that wasn't really the case.
Suggestions for how to tune our model or queries would be greatly appreciated.
Update I should have mentioned, this is all with Neo4J 2.1.7. I'm going to give 2.2 a try as per Brian Underwood's suggestion and will report back.
I've had some luck with specifying a limit on the path length. So if you know that it's never more than 30 hops you might try:
MATCH (c {is_interesting: true})
WITH c
MATCH (a)-[:PARENT_OF*1..30]->c
RETURN DISTINCT a
Also, is there an index on the is_interesting property? That could also cause slowness, for sure.
What version of Neo4j are you using? If you are using or if you upgrade to 2.2.0, you get to use the new query profiling tools:
http://neo4j.com/docs/2.2.0/how-do-i-profile-a-query.html
Also if you use them in the web console you get a nice graph-ish tree thing (technical term) showing each step.
After exploring things with the profiling tools in Neo4J 2.2 (thanks to Brian Underwood for the tip) it's pretty clear that (at the moment) Neo4J doesn't do any pre-filtering on edge properties, which leads to nasty combinatorial explosions with long paths.
For example the original query:
match (a)-[r:PARENT_OF* {special: true}]->(c {is_interesting: true})
return distinct a;
finds all the paths from a to c and then eliminates the ones that have edges that aren't special. Since there are many millions of paths from a to c, this is totally infeasible.
If I instead add a IS_SPECIAL edge wherever there was a PARENT_OF edge that had {special: true}, then the queries become really fast, allowing me to push back around 100 generations in under a second.
This query creates all the new edges:
match (a)-[r:PARENT_OF {special: true}]->(b)
create (a)-[:IS_SPECIAL]->(b);
and takes under a second to add 91K relationships in our graph.
Then
match (c {is_interesting: true})
with c
match (a)-[:IS_SPECIAL*]->(c)
return distinct a;
takes under a second to find the 112 nodes along the "special" path back from a unique target node c. Matching c first and limiting the set of nodes using with c seems to also be important, as Neo4J doesn't appear to pre-filter on node properties either, and if there are several "interesting" target nodes things get a lot slower.

How much does the architecture of data affect the speed of a query

I have the following nodes and relationships in Neo4j database.
The grey and the pink node are furtherly connected with more nodes. Running the following query:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..3]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
I get a result very fast (the node n:RealNode is not one of the nodes in the image).
If I increase the depth to 4 like:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..4]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
The response gets extremely slow. I will never get a response with depth 5 etc.
The depth 4 is actually the relationship between the blue-pink node. So my question is: can the architecture of data (in this case) affect in such a great level the speed of the query? If yes what should I do?
I have tried to run the query also using parameters but the result was the same. Also the gid of n:RealNode is an indexed value.
The architecture of your data has a huge, no...massive impact on query performance. There's a lot you can do with improving performance by reformulating your query, but you can do even more than that by changing your data model.
The model needs to be chosen in a way that's an accurate depiction of the real-world domain, but it often also has to make certain concessions to usage patterns. If you know you're going to do certain queries over and over, it makes sense to choose a data model that makes it easy for the DBMS to answer that query. In the RDBMS world, that entire line of thinking gets summarized in the word "denormalization". In graph databases, the concept is the same but the way you go about it is different.
The thing to keep in mind when adjusting your data model is that neo4j is good at traversing relationships fast, and that with all queries, the less data you have to consider, the faster the query will go.
So in your case, I don't know how many nodes branch off of each other node by a :CONTAINS relationship, but I'm guessing that at each level of the hierarchy you have many items below it. So going from level 4 to level 5 probably doesn't just add a fixed number of additional nodes, but if say each level of the hierarchy has 3x the number of nodes as the level above, the deeper you go, the more you're multiplying how much data you have to consider. If it's 10x...then ouch.
You have many different options. One is to create short-cut relationships, and "pre-materialize" the query. Imagine creating :grandfather and :greatgrandfather relationships to "hop" levels of the tree. That would make it faster. Another way would be to filter intermediate nodes, or the return nodes, so that you're not considering everything, but some subset.
In the end, really huge queries will always take longer than really small ones. You must first begin with a careful understanding of what data you want, and how often you have to run this query. I would not attempt to optimize your data model for infrequently run queries, but if you do this all the time, you should look at your options. Your query to me looks like it's going to return a whole lot of data no matter what you do.

Getting a "slice" of a linked-list in neo4j

So, I have a structure that resembles a linked-list. Each node has a prev field for an id to the previous node, and I link them together using a chain relationship. There are some cases when a node is not part of this chain, ie, it's "prev" points to another node, but nothing points to it.. or only 1 node points to it.
I want to take a "slice" of this list, only including the nodes that are directly linked. ie, from the point of node A, back to node B, return all nodes in between.
This is what I have so far
match (fb {id: A}) - [:chain] -> (eb {id:B})
return fb
However, it returns no results... I think I need it to go recursive in some way, but I'm not sure how to indicate that. I've tried using :chain*, but this tends to process forever. I think I need a way to limit it..
How do I do this?
What about this?
MATCH (fb {id: A})-[:chain*1..10]->(eb {id:B})
RETURN fb
That should limit it to 10 levels. You can change that if you like, obviously, but it affects performance
EDIT: Was just reading this guide to performance tuning:
http://neo4j.com/developer/guide-performance-tuning/
One bit that caught my eye:
If you’re using queries that will have a relatively large working set
(ie. will be traversing long paths, looking at lots of properties, or
collecting large sets of results in order to do sorting, etc) then
you’ll need a larger working heap. If you have small queries that do
very limited traversals and return small amounts of data, you need
less. Assume 1-2GB to start and tune from there

Resources