Is a DFS Cypher Query possible? - neo4j

My database contains about 300k nodes and 350k relationships.
My current query is:
start n=node(3) match p=(n)-[r:move*1..2]->(m) where all(r2 in relationships(p) where r2.GameID = STR(id(n))) return m;
The nodes touched in this query are all of the same kind, they are different positions in a game. Each of the relationships contains a property "GameID", which is used to identify the right relationship if you want to pass the graph via a path. So if you start traversing the graph at a node and follow the relationship with the right GameID, there won't be another path starting at the first node with a relationship that fits the GameID.
There are nodes that have hundreds of in and outgoing relationships, some others only have a few.
The problem is, that I don't know how to tell Cypher how to do this. The above query works for a depth of 1 or 2, but it should look like [r:move*] to return the whole path, which is about 20-200 hops.
But if i raise the values, the querys won't finish. I think that Cypher looks at each outgoing relationship at every single path depth relating to the start node, but as I already explained, there is only one right path. So it should do some kind of a DFS search instead of a BFS search. Is there a way to do so?

I would consider configuring a relationship index for the GameID property. See http://docs.neo4j.org/chunked/milestone/auto-indexing.html#auto-indexing-config.
Once you have done that, you can try a query like the following (I have not tested this):
START n=node(3), r=relationship:rels(GameID = 3)
MATCH (n)-[r*1..]->(m)
RETURN m;
Such a query would limit the relationships considered by the MATCH cause to just the ones with the GameID you care about. And getting that initial collection of relationships would be fast, because of the indexing.
As an aside: since neo4j reuses its internally-generated IDs (for nodes that are deleted), storing those IDs as GameIDs will make your data unreliable (unless you never delete any such nodes). You may want to generate and use you own unique IDs, and store them in your nodes and use them for your GameIDs; and, if you do this, then you should also create a uniqueness constraint for your own IDs -- this will, as a nice side effect, automatically create an index for your IDs.

Related

NEO4J - Matching a path where middle node might exist or not

I have the following graph:
I would look to get all contractors and subcontractors and clients, starting from David.
So I thought of a query likes this:
MATCH (a:contractor)-[*0..1]->(b)-[w:works_for]->(c:client) return a,b,c
This would return:
(0:contractor {name:"David"}) (0:contractor {name:"David"}) (56:client {name:"Sarah"})
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Which returns the desired result. The issue here is performance.
If the DB contains millions of records and I leave (b) without a label, the query will take forever. If I add a label to (b) such as (b:subcontractor) I won't hit millions of rows but I will only get results with subcontractors:
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Is there a more efficient way to do this?
link to graph example: https://console.neo4j.org/r/pry01l
There are some things to consider with your query.
The relationship type is not specified- is it the case that the only relationships from contractor nodes are works_for and hired? If not, you should constrain the relationship types being matched in your query. For example
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b)-[w:works_for]->(c:client)
RETURN a,b,c
The fact that (b) is unlabelled does not mean that every node in the graph will be matched. It will be reached either as a result of traversing the works_for or hired relationships if specified, or any relationship from :contractor, or via the works_for relationship.
If you do want to label it, and you have a hierarchy of types, you can assign multiple labels to nodes and just use the most general one in your query. For example, you could have a label such as ExternalStaff as the generic label, and then further add Contractor or SubContractor to distinguish individual nodes. Then you can do something like
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b:ExternalStaff)-[w:works_for]->(c:client)
RETURN a,b,c
Depends really on your use cases.

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

How to cluster nodes together in Neo4j

My graph is 1M nodes. The data model is intentionally simple. There are Entities and IDType nodes. A single Entity may have 1:many IDType nodes. And an IDType node may be connected to 1:many Entities. This forms the graph.
The goal is to find all clusters of IDType's and Entities that are connected together into what I call a cluster of nodes (subgraph I guess some call it). Imagine if we had 1M nodes. I would like to find "clusters" like this in the graph data, I'm trying to figure out how to do that. I've written the cypher query that I believe does it, but it's not clear to me if it's doing what is intended.
The question: how do I efficiently traverse my graph and cluster together nodes so that there is a single row or group of rows that I can return as a row-based result set to my python driver program to then operate over that cluster. While this doesn't need to be the exact structure of my result, this is a sense of what I'm looking for.
cluster|nodes
1|2,3,4,5,6,7
2|10,11,12,13
3|15,17,19,20,21,25,27,28,33
Where the "cluster" is some arbitrary clustering of the list of nodes (frankly if I have a single line that's just a collection of clusters or some other way of telling they are all related, then I'm golden). The "nodes" number represents a unique integer-based property that we tag to every Entity node.
The query is below. The concept is that an "Entity" node can have 1 or many "ID" nodes and I'm trying to get all "Entity" and "ID" that are related to each other through the relationship "HAS_ID".
Conceptually, if there is a relationship that exists in the data like this Entity1-->ID1<--Entity2-->ID2<--Entity3-->ID3<--Entity4-->ID4<--Entity5 then I want to "cluster" them together so that I can create a unique number that represents this group of nodes. With my example, there are 5 entities, but there could just as easily be 2 entities, or 50 entities, which are all related to one another, that's why I'm thinking the variable length path is what I need.
The below is my attempt to do this in the graph. But 1) is it correct? 2) is it efficient because it seems to runs indefinitely 3) how do i best "group" these together?
match
(n:Entity)-[e1:HAS_ID*]-(o)
where n.key <> o.key
return *
limit 10
;
I've also tried
match (n:Entity)-[e1:HAS_ID*]-(o)
where n.key <> o.key
with distinct n.key as key_1, o.key as key_2
return key_1, collect(key_2)
limit 100
;
This seems to do close to what I want, but I'm still not getting a single group for a given key, in other words, I can have 5 rows returned but they are all still related, which I'd rather have 1 row in that case... He's an example, you can see that key "49518" is on the first and second row, I'd rather have one row that grouped them all together.
49518 [49004, 49871, 49940, 50525, 49101, 49625, 50165, 50017, 49098, 50383]
49940 [49088, 49706, 50292, 50470, 49140, 49258, 49216, 49559, 50004, 50346, 49237, 49518, 49894, 49101, 49625, 50165, 50017, 49098, 50383]
Well, for one, your query doesn't match the relationship pattern you described.
Each of your arrows in your pattern is a [:HAS_ID] relationship, so if entities and IDs are always alternating between each relationship, then your current query would only match patterns like this:
(:Entity)-[:HAS_ID]->(:ID)<-[:HAS_ID]-(:Entity)-[:HAS_ID]->(:ID)<-[:HAS_ID]-(:Entity)
3 entities, 2 IDs, 4 relationships. That doesn't match your example pattern of 5 entities, 4 IDs, and 8 relationships. So at the very least, you'll want to alter your pattern to use *8.
As for efficiency...the thing you're trying to do seems rather inefficient, as it must attempt to find this pattern on every single :Entity node in your graph, trying every single :HAS_ID relationship it finds. If your entire graph is made of this same pattern of :Entity and :ID and :HAS_ID, then your query is going to be traversing your entire graph, not once but multiple times.
You are going to get duplicate results. Even if we assume that your entire graph is made up of isolated 5 entity / 4 ID / 8 relationship chains like a snake, as in your example (an entity either being at the end of the chain with one link to an ID, or somewhere in the middle with links to 2 IDs), then you'll be getting 2 matches for that same group of nodes, one matching from one end of the chain, the other matching the other end. And that's the simple case...I'm guessing your graph could be much more complex than this, allowing even more possibilities for many different patterns to match on the exact same group of nodes. A unique path using your pattern does not equate to a unique grouping of nodes.
At the very least, you'll probably want to match on a pattern and use RETURN DISTINCT NODES(p) to enforce unique sets of nodes, but I still think the matching may take quite a bit of time.

Most efficient way to get all connected nodes in neo4j

The answer to this question shows how to get a list of all nodes connected to a particular node via a path of known relationship types.
As a follow up to that question, I'm trying to determine if traversing the graph like this is the most efficient way to get all nodes connected to a particular node via any path.
My scenario: I have a tree of groups (group can have any number of children). This I model with IS_PARENT_OF relationships. Groups can also relate to any other groups via a special relationship called role playing. This I model with PLAYS_ROLE_IN relationships.
The most common question I want to ask is MATCH(n {name: "xxx") -[*]-> (o) RETURN o.name, but this seems to be extremely slow on even a small number of nodes (4000 nodes - takes 5s to return an answer). Note that the graph may contain cycles (n-IS_PARENT_OF->o, n<-PLAYS_ROLE_IN-o).
Is connectedness via any path not something that can be indexed?
As a first point, by not using labels and an indexed property for your starting node, this will already need to first find ALL the nodes in the graph and opening the PropertyContainer to see if the node has the property name with a value "xxx".
Secondly, if you now an approximate maximum depth of parentship, you may want to limit the depth of the search
I would suggest you add a label of your choice to your nodes and index the name property.
Use label, e.g. :Group for your starting point and an index for :Group(name)
Then Neo4j can quickly find your starting point without scanning the whole graph.
You can easily see where the time is spent by prefixing your query with PROFILE.
Do you really want all arbitrarily long paths from the starting point? Or just all pairs of connected nodes?
If the latter then this query would be more efficient.
MATCH (n:Group)-[:IS_PARENT_OF|:PLAYS_ROLE_IN]->(m:Group)
RETURN n,m

How to select relationships spreading from neo4j?

We have a scenario to display relationships spreading pictures(or messages) to user.
For example: Relationship 1 of Node A has a message "Foo", Relationship 2 of Node 2 also has same message "Foo" ... Relationship n of Node n also has same message "Foo".
Now we are going to display a relationship graph by query Neo4j.
This is my query:
MATCH (a)-[r1]-()-[r2]-()-[r3]-()-[r4]
WHERE a.id = '59072662'
and r2.message_id = r1.target_message_id
and r3.message_id = r2.target_message_id
and r4.message_id = r3.target_message_id
RETURN r1,r2,r3,r4
The problem is, this query does not work if there are only 2 levels of linking. If there is only a r1 and r2, this query returns nothing.
Please tell me how to write a Cypher query returns a set of relationships of my case?
Adding to Stefan's answer.
If you want to keep track of how pictures spread then you would also include a relationship to the image like:
(message)-[:INCLUDES]->(image)
If you want how a specific picture got spread in the message network:
MATCH (i:Image {url: "X"}), p=(recipient:User)<-[*]-(m:Message)<-[*]-(sender:User)
WHERE (m)-[:INCLUDES]->(i) WITH length(p) as length, sender ORDER BY length
RETURN DISTINCT sender
This will return all senders, ordered by path length, so the top one should be the original sender.
If you're just interested in the original sender you could use LIMIT 1.
Alternatively, if you find yourself traversing huge networks and hitting performance issue because of the massive paths that have to be traversed, you could also add a relationship between the message and the original uploader.
The answer to the question you psoted at the bottom, about the way to get a set of relationships in a variable length path:
You define a path, like in the example above
p=(recipient:User)<-[*]-(m:Message)<-[*]-(sender:User)
Then, to access the relationships in that path, you use the rels function
RETURN rels(p)
You didn't provide much details on your use case. From my experience I suggest that you rethink your way of graph data modelling.
A message seems to be a central concept in your domain. Therefore the message should be probably modeled as a node. To connect (a) and (b) via message (m), you might use something like (a)-[:SENT]->(m {message_id: ....})-[TO:]->(b).
Using this (m) could easily have a REFERS_TO relationship to another message making the query above way more graphy.

Resources