Why so many db hit in neo4j? - neo4j

There is total 1 Category node and 2 Template node in my case. I put an * in [*] to support more further scenarios. But why there are so many db hit in this cypher for current data?

It's probably the * in the relationship part of your query that's doing it.
While you've got only one Category node and two Template nodes, you've asked Neo4j to hop through any number of relationships to get from one to the other and not given it any help to narrow down the search besides specifying the starting node.
For example, if your Category was connected to 100,000 other nodes (of any label, not just Template) you've forced Neo4j to jump through every single one of them looking to see if there's a path to a Template node - and if those nodes have their own connections then they all need to be explored too, because the depth of the traversal isn't constrained.
If you know how Category and Template nodes can be connected in ways you're interested in (for example, if there's only every some specific set of relationships you want to traverse) then you'll radically improve the performance of the query. Equally, reducing the maximum length of the path will help.

Related

Find all nodes with two-way relationships starting from one specific node using cypher in neo4j

neo4j nodes and relationships
This is quite a tough job. I'm trying to find all nodes with two-way relationships starting from a specific node. Based on the image above, I would like to find all two-way relationships starting from node 1. Only nodes with two-way relationships match. For example, node 1,3,4 matches and node 1,2,3 matches as two separate groups. However, if node 2 and 4 has a two-way relationship, then node 1,2,3,4 matches as one group. The main idea is that all nodes are linked both ways in such a group. My idea is to find all nodes with two-way relationships starting from 1 and continue processing, but I'm not able to continue. Can anyone help me with this problem, thanks a lot. By the way, only the largest 'two-way-circle' is needed.
Your problem looks a lot like finding strongly connected components in the graph. As defined in the docs.
A directed graph is strongly connected if there is a path between all
pairs of vertices ( nodes ). This algorithms treats the graph as directed, so
the direction of the relationship is important and strongly connected
compoment exists only if there are relationships between nodes in both
direction.
Check out more in the documentation. You will need neo4j-graph-algorithms.
Example query with writing back the component of the graph to the node.
CALL algo.scc('Label','C', {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize
And then you can find your biggest component with the following query.
MATCH (u:Label)
RETURN distinct(u.partition) as partition,count(*) as size_of_partition
ORDER by size_of_partition DESC LIMIT 1

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Fast search for unconnected nodes in big neo4j graph

So, i've created a Neo4j graph database out of a relational database. The graph database has about 7 million nodes, and about 9 million relationships between the nodes.
I now want to find all nodes, that are not connected to nodes with a certain label (let's call them unconnected nodes). For example, i have nodes with the labels "Customer" and "Order" (let's call them top-level-nodes). I want to find all nodes that have no relationship from or to these top-level-nodes. The relationship doesn't have to be direct, the nodes can be connected via other nodes to the top-level-nodes.
I have a cypher query which would solve this problem:
MATCH (a) WHERE not ((a)-[*]-(:Customer)) AND not ((a)-[*]-(:Order)) RETURN a;
As you can imagine, the query will need a long time to execute, the performance is bad. Most likely because of the undirected relationship and because it doesn't matter via how many nodes the relationship can be made. However, the relationship directions don't matter, and i need to make sure that there is no path from any node to one of the top-level-nodes.
Is there any way to find the unconnected nodes faster ? Note that the database is really big, and there are more than 2 labels which mark top-level-nodes.
You could try this approach, which does involve more operations, but can be run in batches for better performance (see apoc.periodic.commit() in the APOC procedures library).
The idea is to first apply a label (say, :Unconnected) to all nodes in your graph (batch execute with apoc.periodic.commit), and then, taking batches of top level nodes with that label, matching to all nodes in the subgraphs extending from them and removing that label.
When you finally have run out of top level nodes with the :Unconnected label (meaning all top level nodes and their subgraphs no longer have this label) then the only nodes remaining in your graph with the :Unconnected label are not connected to your top level nodes.
Any approach to this kind of operation will likely be slow, but the advantage again is that you can process this in batches, and if you get interrupted, you can resume. Once your queries are done, all the relevant unconnected nodes are now labeled for further processing at your convenience.
Also, one last note, in Neo4j undirected relationships have no arrows in the syntax ()-[*]-().
MATCH (a)
WHERE
not (a:Customer OR a:Order)
AND shortestPath((a)-[*]-(:Customer)) IS NULL
AND shortestPath((a)-[*]-(:Order)) IS NULL
RETURN a;
If you could add rel-types it would be faster.
One further optimization could be to check the nodes of an :Customer path for an :Order node and vice versa. i.e.
NONE(n in nodes(path) WHERE n:Order)
In general, this might be rather a set operation, i.e.
expand around all order and customer nodes in parallel into two sets
and compute the overlap between the two sets.
Then remove the overlap from the total number of nodes.
I added an issue for apoc here to add such a function or procedure
https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/223

Most efficient way to get all connected nodes in neo4j

The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?
As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.
Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m

An Example Showing the Necessity of Relationship Type Index and Related Execution Plan Optimization

Suppose I have a large knowledge base with many relationship types, e.g., hasChild, livesIn, locatedIn, capitalOf, largestCityOf...
The number of capicalOf relationships is relatively small (say, one hundred) compared to that of all nodes and other types of relationships.
I want to fetch any capital which is also the largest city in their country by the following query:
MATCH city-[:capitalOf]->country, city-[:largestCityOf]->country RETURN city
Apparently it would be wise to take the capitalOf type as clue, scan all 100 relationship with this type and refine by [:largestCityOf]. However the current execution plan engine of neo4j would do an AllNodesScan and Expand. Why not consider add an "RelationshipByTypeScan" operator into the current query optimization engine, like what NodeByLabelScan does?
I know that I can transform relationship types to relationship properties, index it using the legacy index and manually indicate
START r=relationship:rels(rtype = "capitalOf")
to tell neo4j how to make it efficient. But for a more complicated pattern query with many relationship types but no node id/label/property to start from, it is clearly a duty of the optimization engine to decide which relationship type to start with.
I saw many questions asking the same problem but getting answers like "negative... a query TYPICALLY starts from nodes... ". I just want to use the above typical scenario to ask why once more.
Thanks!
A relationship is local to its start and end node - there is no global relationship dictionary. An operation like "give me globally all relationships of type x" is therefore an expensive operation - you need to go through all nodes and collect matching relationships.
There are 2 ways to deal with this:
1) use a manual index on relationships as you've sketched
2) assign labels to your nodes. Assume all the country nodes have a Country label. Your can rewrite your query:
MATCH (city)-[:capitalOf]->(country:Country), (city)-[:largestCityOf]->(country) RETURN city
The AllNodesScan is now a NodeByLabelScan. The query grabs all countries and matches to the cities. Since every country does have one capital and one largest city this is efficient and scales independently of the rest of your graph.
If you put all relationships into one index and try to grab to ~100 capitalOf relationships that operation scales logarithmically with the total number of relationships in your graph.

Resources