Traverse the graph to get possible topology - neo4j

I am writing my master thesis with Neo4j Database and I meet a problem. I need your help.
The picture at left is the data I saved in Neo4j, the whole picture represents how an application could be deployed in cloud. Every node represents a service.
For example, I have an Apach Module and I can "hosted_on" an Apache Server. The green line represents a possible option, because an Apache server can hosted on a Windows system or a Linux system.
So there are two possibilities for deployment, showed at right.
At right is what I want, I call it topology, it defines how an application deployment looks like.
what I want is to retrieve all possible typologies.
How I can get these possibilities topology by Cypher or Java traverse API?
Thanks very much..

I'm not sure if this is what you are getting at, but it might be helpful to consider the "What is related and how?" query:
// What is related, and how
MATCH (a)-[r]->(b)
WHERE labels(a) <> [] AND labels(b) <> []
RETURN DISTINCT head(labels(a)) AS This, type(r) as To, head(labels(b)) AS That
LIMIT 10
This will return Node labels and relationship names that are connected by at least one relationship in the graph. Is that what you mean by topology?

Related

Cypher - unlimited path length and large path length queries hang

I am using Neo4j Community 4.0.4.
I have encountered this issue using the offical Bolt driver for Python, but it is also completely reproducible in the Neo4j browser (version 4.0.7).
I have a very simple graph for now, consisting of the following node and relationship types:
(:Document)-[:contains]->(:Block)
(:Block)<-[:prev]-(:Block)-[:next]->(:Block)
There are only 75 nodes in my entire test database for now - 1 Document node and 74 Block nodes.
Running the following Cypher statement brings the CPU to 100% and the memory utilization rises indefinitely, after which I have to kill the session:
match (d:Doc{name: 'doc name'})
optional match (d)-[*]-(n)
return d,n
I also got the Java heap size error at some point.
It only starts to work if I set a strict upper bound on the relationship or specify the direction, e.g.:
optional match (d)-[*..5]->(n)
For example, this already does not work (the answer takes forever so I have to kill the session):
optional match (d)-[*..5]-(n)
Considering that (a) I am doing a strictly local graph traversal that graph databases are supposed to be exceptionally good at, (b) clusters associated with different starting nodes are NOT connected and (c) my test data set is tiny, how can this be happening?
From the symptoms it appears that the engine simply does not keep track of which nodes and relationships were already visited when preparing the results ... or am I missing something?
UPDATE:
This was just answered via the Neo4j community forum by a Neo4j staff member:
https://community.neo4j.com/t/getting-paths-of-any-length-or-long-paths-does-not-work/18298
I wrongly assumed that Cypher would just dynamically switch from the path uniqueness traversal to the node uniqueness traversal just because the operation following the match dealt only with nodes and not with relationships.
Poor assumption on my part - not only Cypher doesn't do it automatically, there is no way AT ALL in core Cypher to drop a path during traversal if all the nodes in the path were aleady visited.
The APOC-based solution was suggested:
match (d:Doc{name: 'doc name'})
CALL apoc.path.subgraphNodes(d, {}) YIELD node as n
return d, n
In my case I have disconnected sub-graphs that are tens of thousands of nodes each and are relatively dense. This came up when trying to delete a (:Doc) node and everything that's connected to it before re-loading a new version of the sub-graph into Neo4j:
disconnect delete d, n
I see this task of "removing the old version before re-loading" as a very common operational task for sub-graphs that many people may have in their use cases... Installing and managing additional libraries (like APOC or the Graph Data Science library) seems like an overkill for something this simple... But it's either that or making the deletions more targeted.
A MATCH clause avoids traversing the same relationship twice, so that would not be the issue. However, it can still travel between the same 2 nodes multiple times (as long as different relationships are used).
The main thing to consider is that variable-length relationship patterns have exponential (time and memory) complexity. If the nodes being traversed have an average of R relevant relationships, then the MATCH clause has to traverse about R**P possible paths of length P. The higher that P gets (especially with no upper bound), the worse it gets. But a high R also hurts.

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Neo4j Cypher find all paths exploring sorted relationships

I'm struggling for days to find a way for finding all paths (to a maximum length) between two nodes while controlling the path exploration by Neo4j by sorting the relationships that are going to be explored (by one of their properties).
So to be clear, lets say I want to find K best paths between two nodes until a maximum length M. The query will be like:
match (source{name:"source"}), (target{name:"target"}),
p = (source)-[*..M]->(target)
return p order by length(p) limit K;
So far so good. But lets say the relationships of the path have a property called "priority". What I want is to write a query that tells Neo4j on each step of path exploration which relationships should be explored first.
I know that can be possible when I use the java libraries and an embedded database (By implementing PathExpander interface and giving it as input to the GraphAlgoFactory.allSimplePaths() function in Java).
But now I'm trying to find a way doing this in a server mode database access using Bolt or REST api.
Is there any way to do this in the server mode? Or maybe using Java libraries functions while accessing the graph in server mode?
use labels and an index to find your two start-nodes
perhaps consider allShortestPaths to make it faster
try this:
match (source{name:"source"}), (target{name:"target"}),
p = (source)-[rels:*..20]->(target)
return p, reduce(prio=0, r IN rels | prio + r.priority) as priority
order by priority ASC, length(p)
limit 100;
I had a very similar problem. I was trying to find the shortest path from one node to all other nodes. I had written a query similar to the one in the answer above (https://stackoverflow.com/a/38030536/783836) and couldn't get it to perform in any reasonable time.
Asking Can Graph DBs perform well with unspecified end nodes? pointed me to the solution: the Single Shortest Path algorithm.
In Neo4j you need to install the Graph Data Science Library and make use of this function: gds.alpha.shortestPath.deltaStepping.stream

Neo4j - is it possible to visualise a simple overview of my database?

I've got my graph database, populated with nodes, relationships, properties etc. I'd like to see an overview of how the whole database is connected, each relationship to each node, properties of a node etc.
I don't mean view each individual node, but rather something like an ERD from a relational database, something like this, with the node labels. Is this possible?
You can use the metadata by running the command call db.schema().
In Neo4j v4 call db.schema() is deprecated, you can now use call db.schema.visualization()
As far as I know, there is no straight-forward way to get a nicely pictured diagram of a neo4j database structure.
There is a pre-defined query in the neo4j browser which finds all node types and their relationships. However, it traverses the complete graph and may fail due to memory errors if you have to much data.
Also, there is neoprofiler. It's a tool which claims to so what you ask. I never tried and it didn't get too many updates lately. Still worth a try: https://github.com/moxious/neoprofiler
Even though this is not a graphical representation, this query will give you an idea on what type of nodes are connected to other nodes with what type of relationship.
MATCH (n)
OPTIONAL MATCH (n)-[r]->(x)
WITH DISTINCT {l1: labels(n), r: type(r), l2: labels(x)}
AS `first degree connection`
RETURN `first degree connection`;
You could use this query to then unwind the labels to write that next cypher query dynamically (via a scripting language and using the REST API) and then paste that query back into the neo4j browser to get an example set of the data.
But this should be good enough to get an overview of your graph. Expand from here.

Most efficient way to get all connected nodes in neo4j

The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?
As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.
Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m

Resources