Limiting nodes per label

Limiting nodes per label - neo4j

I have a graph with currently around the several thousand nodes with each node having between two to ten relationships. If we look at a single node and its connections, they would look like somewhat this:
The nodes with alphabetical characters are category nodes. All other nodes are content nodes that have an associated with relationship with these category nodes and their colour denotes which label(s) is/are attached to it. For simplicity, every node has a single label, and each node is only connected to a single other node:
Blue: Categories
Green: Scientific Publications
Orange: General Articles
Purple: Blog Posts
Now, the simplest thing I'm trying to do is getting a certain amount of related content nodes to a given node. The following returns all twenty related nodes:
START n = node(1)
MATCH (n)-->(category)<--(m)
RETURN m
However, I would like to filter this to 2 nodes per label per category (and afterwards play with ordering by nodes that have multiple categories overlapping with the starting node.
Currently I'm doing this by getting the results from the above query, and then manually looping through the results, but this feels like redundant work to me.
Is there a way to do this via Neo4j's Cipher Query language?

This answer extends #Stefan's original answer to return the result for all the categories, not just one of them.
START p = node(1)
MATCH (p)-->(category)<--(m)
WITH category, labels(m) as label, collect(m)[0..2] as nodes
UNWIND label as lbl
UNWIND nodes AS n
RETURN category, lbl, n
To facilitate manual verification of the results, you can also add this line to the end, to sort the results. (This sorting should probably not be in your final code, unless you really need sorted results and are willing expend the extra computing time):
ORDER BY id(category), lbl

Cypher has a labels function returning an array with all labels for a given node. Assuming you only have exactly one label per m node the following approach could work:
START n = node(1)
MATCH (n)-->(category)<--(m)
WITH labels(m)[0] as label, collect[m][0..2] as nodes
UNWIND nodes as n
RETURN n
The WITH statements builds up a seperate collection of all nodes sharing the same label. Using the subscript operator [0..2] the collection just keeps the first two elements. Unwind then converts the collection into separate rows for the result. From here on you can apply ordering.

Related

NEO4J - Matching a path where middle node might exist or not

I have the following graph:
I would look to get all contractors and subcontractors and clients, starting from David.
So I thought of a query likes this:
MATCH (a:contractor)-[*0..1]->(b)-[w:works_for]->(c:client) return a,b,c
This would return:
(0:contractor {name:"David"}) (0:contractor {name:"David"}) (56:client {name:"Sarah"})
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Which returns the desired result. The issue here is performance.
If the DB contains millions of records and I leave (b) without a label, the query will take forever. If I add a label to (b) such as (b:subcontractor) I won't hit millions of rows but I will only get results with subcontractors:
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Is there a more efficient way to do this?
link to graph example: https://console.neo4j.org/r/pry01l

There are some things to consider with your query.
The relationship type is not specified- is it the case that the only relationships from contractor nodes are works_for and hired? If not, you should constrain the relationship types being matched in your query. For example
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b)-[w:works_for]->(c:client)
RETURN a,b,c
The fact that (b) is unlabelled does not mean that every node in the graph will be matched. It will be reached either as a result of traversing the works_for or hired relationships if specified, or any relationship from :contractor, or via the works_for relationship.
If you do want to label it, and you have a hierarchy of types, you can assign multiple labels to nodes and just use the most general one in your query. For example, you could have a label such as ExternalStaff as the generic label, and then further add Contractor or SubContractor to distinguish individual nodes. Then you can do something like
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b:ExternalStaff)-[w:works_for]->(c:client)
RETURN a,b,c
Depends really on your use cases.

Cypher Query filter on related Nodes

I am trying to apply filter on my Neo4j Graph DB which has 733922 nodes and 303378 relationships, the DB size is 913.56 MiB. I want to fetch all nodes which has related labels from a specific set, it works if I give upto three related nodes but takes indefinite time to process queries beyond that. If I remove order by then it works for upto five related labels. Is this the most optimised query or am I doing something wrong here? I have attached the PROFILE output for one related label.
MATCH p=(node1:label1{value:'a2cb2c487d9da6941f0f9c1692ed1a5a', source: 'a0b116e2-d9f5-40d4-97ff-6ca6f2ec6c9b'})-[r]-> (:a),(:b),(:c),(:d),(:e),(:f),(:g),(:h),(:i),(:j)
RETURN node1,r
ORDER BY node1.created DESC
LIMIT 25

The following
(:b),(:c),(:d),(:e),(:f),(:g),(:h),(:i),(:j)
creates a Cartesian product between all nodes with labels b,c,d etc. which will quickly grow.
If you want any of the labels a..b use following:
MATCH p=(node1:label1{value:'a2cb2c487d9da6941f0f9c1692ed1a5a', source: 'a0b116e2-d9f5-40d4-97ff-6ca6f2ec6c9b'})-[r]-> (related)
WHERE related:a OR related:b OR ....
If you want to pass labels as parameter labels you could use following in the WHERE clause
WHERE any(x in labels(n) WHERE x in $labels)

How to filter results by node label in neo4j cypher?

I have a graph database that maps out connections between buildings and bus stations, where the graph contains other connecting pieces like roads and intersections (among many node types).
What I'm trying to figure out is how to filter a path down to only return specific node types. I have two related questions that I'm currently struggling with.
Question 1: How do I return the labels of nodes along a path?
It seems like a logical first step is to determine what type of nodes occur along the path.
I have tried the following:
MATCH p=(a:Building)-[:CONNECTED_TO*..5]-(b:Bus)
WITH nodes(p) AS nodes
RETURN DISTINCT labels(nodes);
However, I'm getting a type exception error that labels() expects data of type node and not Collection. I'd like to dynamically know what types of nodes are on my paths so that I can eventually filter my paths.
Question 2: How can I return a subset of the nodes in a path that match a label I identified in the first step?
Say I found that that between (a:Building) and (d1:Bus) and (d2:Bus) I can expect to find (:Intersection) nodes and (:Street) nodes.
This is a simplified model of my graph:
(a:Building)--(:Street)--(:Street)--(b1:Bus)
\(:Street)--(:Intersection)--(:Street)--(b2:Bus)
I've written a MATCH statement that would look for all possible paths between (:Building) and (:Bus) nodes. What would I need to do next to filter to selectively return the Street nodes?
MATCH p=(a:Building)-[r:CONNECTED_TO*]-(b:Bus)
// Insert logic to only return (:Street) nodes from p
Any guidance on this would be greatly appreciated!

To get the distinct labels along matching paths:
MATCH p=(a:Building)-[:CONNECTED_TO*..5]-(b:Bus)
WITH NODES(p) AS nodes
UNWIND nodes AS n
WITH LABELS(n) AS ls
UNWIND ls AS label
RETURN DISTINCT label;
To return the nodes that have the Street label.
MATCH p=(a:Building)-[r:CONNECTED_TO*]-(b:Bus)
WITH NODES(p) AS nodes
UNWIND nodes AS n
WITH n
WHERE 'Street' IN LABELS(n)
RETURN n;

Cybersam's answers are good, but their output is simply a column of labels...you lose the path information completely. So if there are multiple paths from a :Building to a :Bus, the first query will only output all labels in all nodes in all patterns, and you can't tell how many paths exist, and since you lose path information, you cannot tell what labels are in some paths but not others, or common between some paths.
Likewise, the second query loses path information, so if there are multiple paths using different streets to get from a :Building to a :Bus, cybersam's query will return all streets involved in all paths. It is possible for it to output all streets in your graph, which doesn't seem very useful.
You need queries that preserve path information.
For 1, finding the distinct labels on nodes on each path I would offer this query:
MATCH p=(:Building)-[:CONNECTED_TO*..5]-(:Bus)
WITH NODES(p) AS nodes
WITH REDUCE(myLabels = [], node in nodes | myLabels + labels(node)) as myLabels
RETURN DISTINCT myLabels
For 2, this query preserves path information:
MATCH p=(:Building)-[:CONNECTED_TO*..5]-(:Bus)
WITH NODES(p) AS nodes
WITH FILTER(node in nodes WHERE (node:Street)) as pathStreets
RETURN pathStreets
Note that these are both expensive operations, as they perform a cartesian product of all buildings and all busses, as in the queries in your description. I highly recommend narrowing down the buildings and busses you're matching upon, hopefully to very few or specific buildings at least.
I also encourage limiting how deep you're looking in your pattern. I get the idea that many, if not most, of your nodes in your graph are connected by :CONNECTED_TO relationships, and if we don't cap that to a reasonable amount, your query could be finding every single path through your entire graph, no matter how long or convoluted or nonsensical, and I don't think that's what you want.

Limitation on Selection of nodes using labels

I want to keep restriction in cypher in selecting alike nodes without knowing the property values.
Let say, I have few nodes with BUYER as labels. And I don't know anything more than that regarding the database. And I wanted to see the list of properties for the BUYER nodes. And, all BUYER nodes have same set of properties. Then, I did this
My Approach:
MATCH (n:Buyer)
with keys(n) as each_node_keys
UNWIND each_node_keys as all_keys
RETURN DISTINCT(all_keys)
In my approach I can clearly see that, first line of query, MATCH(n:Buyer) is selecting all the nodes, iterating all the nodes, collecting all the properties and then filtering. Which is not a good idea.
In order to overcome this, I wanted to LIMIT the nodes we are selecting,
like instead of selecting all the nodes, How can I restrict it to select only one node and since I don't know any property values, I cannot filter using the property. Once I pick a node then I should not pick further nodes. How can I do that.

If as you said all Buyer nodes have the same property keys, you can just limit the MATCH for one node :
MATCH (n:Buyer)
WITH n LIMIT 1
RETURN keys(n)

Why the node's label affect the query performance significantly in Neo4j?

I try to simplify my question. If all nodes in Neo4jDB have same label Science, what's the difference between MATCH n WHERE n.ID="UUID-0001" RETURN n and MATCH (n:Science) WHERE n.ID="UUID-0001" RETURN n. Why the performance is not the same?
My Neo4j database contains about 70000 nodes and 100 relations.
The nodes have two types: Paper and Author, and they both have an ID field.
I created each node with corresponding label, and I also use ID as the index.
However, since one of my functions need to query nodes by ID without considering the label. The query just like: MATCH n WHERE n.ID="UUID-0001" RETURN n. The query time cost about 4000~5000 ms!
But after adding Science for each node and using MATCH (n:Science) WHERE n.ID="UUID-0001" RETURN n. The query time became about 1000~1100 ms. Does anyone know the difference between these two cases?
PS. Count(n:Science) = Count(n:Paper) + Count(n:Author), which mean each node has two labels.

Because for every label Neo4j automatically creates an extra index. The Cypher language can be broadly thought of as piping + filtering, so Match n WHere ... will first get every node and then filter on the where part. Whereas Match (n:Science) Where... will get every node with label science (using an index) and then try to match the where. From your query performance we can see that about 1/5th of your nodes were marked science so the query runs in a fifth he time, because it did a fifth as many comparisons.

Even though I got the advisement from #phil_20686 and #Michael Hunger, but I think these answers do not solve my question.
I think there are some tricks when using label. If their are 10 thousand nodes in Neo4j DB, and the type of these nodes are the same. The query will perform better when adding label to these nodes.
I hope this post can help some people and give me some feedback if you find the reasons. Thanks.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Limiting nodes per label - neo4j

Related

NEO4J - Matching a path where middle node might exist or not

Cypher Query filter on related Nodes

How to filter results by node label in neo4j cypher?

Limitation on Selection of nodes using labels

Why the node's label affect the query performance significantly in Neo4j?

Categories

Resources