Most efficient way to get all connected nodes in neo4j - neo4j

The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?

As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.

Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m

Related

Why so many db hit in neo4j?

There is total 1 Category node and 2 Template node in my case. I put an * in [*] to support more further scenarios. But why there are so many db hit in this cypher for current data?
It's probably the * in the relationship part of your query that's doing it.
While you've got only one Category node and two Template nodes, you've asked Neo4j to hop through any number of relationships to get from one to the other and not given it any help to narrow down the search besides specifying the starting node.
For example, if your Category was connected to 100,000 other nodes (of any label, not just Template) you've forced Neo4j to jump through every single one of them looking to see if there's a path to a Template node - and if those nodes have their own connections then they all need to be explored too, because the depth of the traversal isn't constrained.
If you know how Category and Template nodes can be connected in ways you're interested in (for example, if there's only every some specific set of relationships you want to traverse) then you'll radically improve the performance of the query. Equally, reducing the maximum length of the path will help.

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Is a DFS Cypher Query possible?

My database contains about 300k nodes and 350k relationships.
My current query is:
start n=node(3) match p=(n)-[r:move*1..2]->(m) where all(r2 in relationships(p) where r2.GameID = STR(id(n))) return m;
The nodes touched in this query are all of the same kind, they are different positions in a game. Each of the relationships contains a property "GameID", which is used to identify the right relationship if you want to pass the graph via a path. So if you start traversing the graph at a node and follow the relationship with the right GameID, there won't be another path starting at the first node with a relationship that fits the GameID.
There are nodes that have hundreds of in and outgoing relationships, some others only have a few.
The problem is, that I don't know how to tell Cypher how to do this. The above query works for a depth of 1 or 2, but it should look like [r:move*] to return the whole path, which is about 20-200 hops.
But if i raise the values, the querys won't finish. I think that Cypher looks at each outgoing relationship at every single path depth relating to the start node, but as I already explained, there is only one right path. So it should do some kind of a DFS search instead of a BFS search. Is there a way to do so?
I would consider configuring a relationship index for the GameID property. See http://docs.neo4j.org/chunked/milestone/auto-indexing.html#auto-indexing-config.
Once you have done that, you can try a query like the following (I have not tested this):
START n=node(3), r=relationship:rels(GameID = 3)
MATCH (n)-[r*1..]->(m)
RETURN m;
Such a query would limit the relationships considered by the MATCH cause to just the ones with the GameID you care about. And getting that initial collection of relationships would be fast, because of the indexing.
As an aside: since neo4j reuses its internally-generated IDs (for nodes that are deleted), storing those IDs as GameIDs will make your data unreliable (unless you never delete any such nodes). You may want to generate and use you own unique IDs, and store them in your nodes and use them for your GameIDs; and, if you do this, then you should also create a uniqueness constraint for your own IDs -- this will, as a nice side effect, automatically create an index for your IDs.

Neo4j: Java API to compute intersection multiple properties

I'm very new in using Neo4j and have a question regarding the computation of intersections of nodes.
Let's suppose, I have the three properties A,B,C and I want to select only the nodes that have all three properties.
I created an index for the properties and thus, I can get all nodes having one of the properties. However, afterwards I have to merge the IndexHits. Is there a way to select directly all nodes having the three properties?
My second idea was to create a node for each property and connect other nodes by relationships. I can then iterate over all relationships and get for each property a list of nodes which are connected. But again, I have to compute the intersection afterwards.
Is there a function I miss here, since I suppose it's a standard problem.
Thanks a lot,
Benny
Do you also have the values you look for? You would start with the property that limits the amount of found nodes most.
MATCH (a:Label {property1:{value1}})
WHERE a.property2 = {value2} AND a.property3 = {value3}
RETURN a
For the Java API and lucene indexes:
gdb.index().forNodes("foo").query("p1:value1 p2:value2 p3:value3")
Lucene query syntax

Find all sub-graphes containing at least one node having a certain property

My graph is composed of multiple "sub-graphes" that are disconnected from one another. These sub-graphes are composed of nodes that are connected with a given relation type.
I would like to get (for example) the list of sub-graphes that contain at least one node that has the property "name" equals "John".
It's equivalent to finding one node per subgraph having this property.
One solution would be to find all the nodes having this property and loop through this list to only pick the ones that are not connected to the previously picked ones. But that would be ugly and quite heavy. Is there an elegant way to do that with Cypher?
I'm trying with something along this direction but have no success so far:
START source=node:user('name:"John"')
MATCH source-[r?:KNOWS*]-target
WHERE r is null
RETURN source
Try this one it may help
START source=node:user('name:"John"')
MATCH source-[r:KNOWS]-()-[r2:KNOWS]-target
WHERE NOT(source-[r:KNOWS]-target)
RETURN target

Resources