Count nodes with a certain property

Count nodes with a certain property - neo4j

I'm working on a dataset describing legislative co-sponsorship. I'm trying to return a table with the name of the bill, the number of legislators who co-sponsored it and then the number of co-sponsors who are Republican and the number who are Democrat. I feel like this should be simple to do but I keep getting syntax errors. Here's what I have so far:
MATCH (b:Bill{Year:"2016"})-[r:COAUTHORED_BY|COSPONSORED_BY|SPONSORED_BY]-(c:Legislators)
WHERE b.name CONTAINS "HB" OR b.name CONTAINS "SB"
RETURN b.name, b.Short_description, COUNT(r) AS TOTAL, COUNT(c.Party = "Republican"), COUNT(c.Party = "Democratic")
ORDER BY COUNT(r) desc
However, in the table this query produces the count of Republican and Democrat sponsors and the count of total sponsors, are all the same. Obviously, the sum of number of Rep and Dem sponsors should equal the total.
What is the correct syntax for this query?

Use the filter:
MATCH (b:Bill{Year:"2016"})
-[r:COAUTHORED_BY|COSPONSORED_BY|SPONSORED_BY]-
(c:Legislators)
WHERE b.name CONTAINS "HB" OR b.name CONTAINS "SB"
WITH b, collect(distinct c) as Legislators
RETURN b.name,
b.Short_description,
SIZE(Legislators) AS TOTAL,
SIZE(FILTER(c in Legislators WHERE c.Party = "Republican")) as Republican,
SIZE(FILTER(c in Legislators WHERE c.Party = "Democratic")) as Democratic
ORDER BY TOTAL desc

Assuming that legislators can ONLY be Republican or Democratic (we'll need to make some adjustments if this isn't the case):
MATCH (b:Bill{Year:"2016"})
WHERE b.name CONTAINS "HB" OR b.name CONTAINS "SB"
WITH b
OPTIONAL MATCH (b)-[:COAUTHORED_BY|COSPONSORED_BY|SPONSORED_BY]-(rep:Legislators)
WHERE rep.Party = "Republican"
OPTIONAL MATCH (b)-[:COAUTHORED_BY|COSPONSORED_BY|SPONSORED_BY]-(dem:Legislators)
WHERE dem.Party = "Democratic"
WITH b, COUNT(DISTINCT rep) as reps, COUNT(DISTINCT dem) as dems
RETURN b.name, b.Short_description, reps + dems AS TOTAL, reps, dems
ORDER BY TOTAL desc

This is a graph model problem, you shouldn't be counting nodes by their properties, if some nodes can have the same property and you want to count in this property, you need to create an intermediate node to set the party:
(b:Bill)-[:SPONSORED_AUTHORED]->(i:Intermediate)-[:TARGET]->(c:Legislators)
and then you create a relation between your intermediate node and the party:
(i:Intermediate)-[:BELONGS_PARTY]->(p:Party{name:"Republican"})
The intermediate node represents the data you actually have in your relationship, but it allows you to create relationships between your operation and a party, making counting easier and way faster.
Keep in mind that this is just an example, without knowing the context I don't know what should be the Intermediate real label and its property, it's just a demo of the concept.
I answered a question using this, feel free to check it (it's a real life example, maybe easier to understand): Neo4j can I make relations between relations?

Related

Neo4J Get only the first relationship per node

I have this graph where the nodes are researchers, and they are related by a relationship named R1, the relationship has a "value" property. How can I get the name of the researchers that are in the relationships with the greatest value? It's like get all the relationships order by r.value DESC but getting only the first relationship per researcher, because I don't want to see on the table duplicated researcher names. By the way, is there a way to get the name of the researchers order by the mean of their relationship "values"? Sorry about the confused topic, I don't speak English very well, thank you very much.
I've been trying things like the Cypher query bellow:
MATCH p=(n)-[r:R1]->(c)
WHERE id(n) < id(c) and r.coauthors = false
return DISTINCT n.name order by n.campus, r.value DESC

Correct me if I am wrong, but you want one result per "n" with the highest value from "r"?
MATCH (n)-[r:R1]->(c)
WHERE r.coauthors = false
WITH n, r ORDER BY r.value DESC
WITH n, head(collect(r)) AS highR
RETURN n.name, highR.value ORDER BY n.campus, highR.value DESC
This will get you all the r's in order and pick the first head(collect(r)) after first doing an ORDER BY. Then you just need to return the values you want. Check out Neo4j Aggregation Functions for some documentation on how aggregation functions work. Good luck!
As an aside, if there is a label that all "n" have, you should add that in your MATCH: MATCH (n:Person) .... it will help speed up your query!

Nodes with relationship to multiple nodes

I want to get the Persons that know everyone in a group of persons which know some specific places.
This:
MATCH (:Place {name:'Breiter Weg'})<-[:knows]-(b:Person)-[:knows]->(:Place {name:'Buchhandel'})
WITH collect(DISTINCT b) as persons
Match (a:Person)
WHERE ALL(b in persons WHERE (a)-[:knows]->(b))
RETURN a
works, but for the second part does a full nodelabelscan, before applying the where clause, which is extremely slow - in a bigger db it takes 8~9 seconds. I also tried this:
MATCH (:Place {name:'Breiter Weg'})<-[:knows]-(b:Person)-[:knows]->(:Place {name:'Buchhandel'})
Match (a:Person)-[:knows]->(b)
RETURN a
This only needs 2ms, however it returns all persons that know any person of group b, instead of those that know everyone.
So my question is: Is there a effective/fast query to get what i want?

We have a knowledge base article for this kind of query that show a few approaches.
One of these is to match to :Persons known by the group, and then count the number of times each of those persons shows up in the results. Provided there aren't multiple :knows relationships between the same two people, if the count is equal to the collection of people from your first match, then that person must know all of the people in the collection.
MATCH (:Place {name:'Breiter Weg'})<-[:knows]-(b:Person)-[:knows]->(:Place {name:'Buchhandel'})
WITH collect(b) as persons
UNWIND persons as b // so we have the entire list of persons along with each person
WITH size(persons) as total, b
MATCH (a:Person)-[:knows]->(b)
WITH total, a, count(a) as knownCount
WHERE total = knownCount
RETURN a

Here is a simpler Cypher query that also compares counts -- the same basic idea used by #InverseFalcon.
MATCH (:Place {name:'Breiter Weg'})<-[:knows]-(b:Person)-[:knows]->(:Place {name:'Buchhandel'}), (a:Person)-[:knows]->(b)
WITH COLLECT({a:a, b:b}) as data, COUNT(DISTINCT b) AS total
UNWIND data AS d
WITH total, d.a AS a, COUNT(d.b) AS bCount
WHERE total = bCount
RETURN a

Neo4j: multiple counts from multiple matches

Given a neo4j schema similar to
(:Person)-[:OWNS]-(:Book)-[:CATEGORIZED_AS]-(:Category)
I'm trying to write a query to get the count of books owned by each person as well as the count of books in each category so that I can calculate the percentage of books in each category for each person.
I've tried queries along the lines of
match (p:Person)-[:OWNS]-(b:Book)-[:CATEGORIZED_AS]-(c:Category)
where person.name in []
with p, b, c
match (p)-[:OWNS]-(b2:Book)-[:CATEGORIZED_AS]-(c2:Category)
with p, b, c, b2
return p.name, b.name, c.name,
count(distinct b) as count_books_in_category,
count(distinct b2) as count_books_total
But the query plan is absolutely horrible when trying to do the second match. I've tried to figure out different ways to write the query so that I can do the two different counts, but haven't figured out anything other than doing two matches. My schema isn't really about people and books. The :CATEGORIZED_AS relationship in my example is actually a few different relationship options, specified as [:option1|option2|option3]. So in my 2nd match I repeat the relationship options so that my total count is constrained by them.
Ideas? This feels similar to Neo4j - apply match to each result of previous match but there didn't seem to be a good answer for that one.

UNWIND is your friend here. First, calculate the total books per person, collecting them as you go.
Then unwind them so you can match which categories they belong to.
Aggregate by category and person, and you should get the number of books in each category, for a person
match (p:Person)-[:OWNS]->(b:Book)
with p,collect(b) as books, count(b) as total
with p,total,books
unwind books as book
match (book)-[:CATEGORIZED_AS]->(c)
return p,c, count(book) as subtotal, total

Neo4j cypher query efficiency and syntax

I am attempting to query an ontology of health represented as an acyclic, directed graph in Neo4j v2.1.5. The database consists of 2 million nodes and 5 million edges/relationships. The following query identifies all nodes subsumed by a disease concept and caused by a particular bacteria or any of the bacteria subtypes as follows:
MATCH p = (a:ObjectConcept{disease}) <-[:ISA*]- (b:ObjectConcept),
q=(c:ObjectConcept{bacteria})<-[:ISA*]-(d:ObjectConcept)
WHERE NOT (b)-->()--(c) AND NOT (b)-->()-->(d)
RETURN distinct b.sctid, b.FSN
This query runs in < 1 second and returns the correct answers. However, adding one additional parameter adds substantial time (20 minutes). Example:
MATCH p = (a:ObjectConcept{disease}) <-[:ISA*]- (b:ObjectConcept),
q=(c:ObjectConcept{bacteria})<-[:ISA*]-(d:ObjectConcept),
t=(e:ObjectConcept{bacteria})<-[:ISA*]-(f:ObjectConcept),
WHERE NOT (b)-->()--(c)
AND NOT (b)-->()-->(d)
AND NOT (b)-->()-->(e)
AND NOT (b)-->()-->(f)
RETURN distinct b.sctid, b.FSN
I am new to cypher coding, but I have to imagine there is a better way to write this query to be more efficient. How would Collections improve this?
Thanks

I already answered that on the google group:
Hi Scott,
I presume you created indexes or constraints for :ObjectConcept(name) ?
I am working with an acyclic, directed graph (an ontology) that models
human health and am needing to identify certain diseases (example:
Pneumonia) that are infectious but NOT caused by certain bacteria
(staph or streptococcus). All concepts are Nodes defined as
ObjectConcepts. ObjectConcepts are connected by relationships such as
[ISA], [Pathological_process], [Causative_agent], etc.
The query requires:
a) Identification of all concepts subsumed by the concept Pneumonia as follows:
MATCH p = (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)
this already returns a number of paths, potentially millions, can you check that with
MATCH p = (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept) return count(*)
b) Identification of all concepts subsumed by Genus Staph and Genus Strep (including the concept Genus Staph and Genus Strep) as follows. Note:
with b MATCH (b) q = (c:ObjectConcept{Strep})<-[:ISA*]-(d:ObjectConcept), h = (e:ObjectConcept{Staph})<-[:ISA*]-(f:ObjectConcept)
this is then the cross product of the paths from "p", "q" and "h", e.g. if all 3 of them return 1000 paths, you're at 1bn paths !!
c) Identify all nodes(p) that do not have a causative agent of Strep (i.e., nodes(q)) or Staph (nodes(h)) as follows:
with b,c,d,e,f MATCH (b),(c),(d),(e),(f) WHERE (b)--()-->(c) OR (b)-->()-->(d) OR (b)-->()-->(e) OR (b)-->()-->(f) RETURN distinct b.Name;
you don't need the WITH or even the MATCH (b),(c),(d),(e),(f)
what connections are there between b and the other nodes ? do you have concrete ones? for the first there is also missing one direction.
the where clause can be a problem, in general you want to show that perhaps this query is better reproduced by a UNION of simpler matches
e.g
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(c:ObjectConcept{name:Strep}) RETURN b.name
UNION
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(e:ObjectConcept{name:Staph}) RETURN b.name
UNION
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(d:ObjectConcept)-[:ISA*]->(c:ObjectConcept{name:Strep}) return b.name
UNION
MATCH (a:ObjectConcept{Pneumonia}) <-[:ISA*]- (b:ObjectConcept)-->()-->(d:ObjectConcept)-[:ISA*]->(c:ObjectConcept{name:Staph}) return b.name
another option would be to utilize the shortestPath() function to find one or all shortest path(s) between Pneumonia and the bacteria with certain rel-types and direction.
Perhaps you can share the dataset and the expected result.

The query was successfully accomplished using UNION functions as follows:
MATCH p = (a:ObjectConcept{sctid:233604007}) <-[:ISA*]- (b:ObjectConcept),
q = (c:ObjectConcept{sctid:58800005})<-[:ISA*]-(d:ObjectConcept)
WHERE NOT (b)-->()--(c) AND NOT (b)-->()-->(d)
RETURN distinct b
UNION
MATCH p = (a:ObjectConcept{sctid:233604007}) <-[:ISA*]- (b:ObjectConcept),
t = (e:ObjectConcept{sctid:65119002}) <-[:ISA*]- (f:ObjectConcept)
WHERE NOT (b)-->()-->(e) AND NOT (b)-->()-->(f)
RETURN distinct b
The query runs in sub 20 seconds vs. 20 minutes by reducing the cardinality of the objects being queried.

Select nodes that has all relationships in Neo4j

Suppose I have two kinds of nodes, Person and Competency. They are related by a KNOWS relationship. For example:
(:Person {id: 'thiago'})-[:KNOWS]->(:Competency {id: 'neo4j'})
How do I query this schema to find out all Person that knows all nodes of a set of Competency?
Suppose that I need to find every Person that knows "java" and "haskell" and I'm only interested in the nodes that knows all of the listed Competency nodes.
I've tried this query:
match (p:Person)-[:KNOWS]->(c:Competency) where c.id in ['java','haskell'] return p.id;
But I get back a list of all Person that knows either "java" or "haskell" and duplicated entries for those who knows both.
Adding a count(c) at the end of the query eliminates the duplicates:
match (p:Person)-[:KNOWS]->(c:Competency) where c.id in ['java','haskell'] return p.id, count(c);
Then, in this particular case, I can iterate the result and filter out results that the count is less than two to get the nodes I want.
I've found out that I could do it appending consecutive match clauses to keep filtering the nodes to get the result I want, in this case:
match (p:Person)-[:KNOWS]->(:Competency {id:'haskell'})
match (p)-[:KNOWS]->(:Competency {id:'java'})
return p.id;
Is this the only way to express this query? I mean, I need to create a query by concatenating strings? I'm looking for a solution to a fixed query with parameters.

with ['java','haskell'] as skills
match (p:Person)-[:KNOWS]->(c:Competency)
where c.id in skills
with p.id, count(*) as c1 ,size(skills) as c2
where c1 = c2
return p.id

One thing you can do, is to count the number of all skills, then find the users that have the number of skill relationships equals to the skills count :
MATCH (n:Skill) WITH count(n) as skillMax
MATCH (u:Person)-[:HAS]->(s:Skill)
WITH u, count(s) as skillsCount, skillMax
WHERE skillsCount = skillMax
RETURN u, skillsCount
Chris

Untested, but this might do the trick:
match (p:Person)-[:KNOWS]->(c:Competency)
with p, collect(c.id) as cs
where all(x in ['java', 'haskell'] where x in cs)
return p.id;

How about this...
WITH ['java','haskell'] AS comp_col
MATCH (p:Person)-[:KNOWS]->(c:Competency)
WHERE c.name in comp_col
WITH comp_col
, p
, count(*) AS total
WHERE total = length(comp_col)
RETURN p.name, total
Put the competencies you want in a collection.
Match all the people that have either of those competencies
Get the count of compentencies by person where they have the same number as in the competency collection from the start
I think this will work for what you need, but if you are building these queries programatically the best performance you get might be with successive match clauses. Especially if you knew which competencies were most/least common when building your queries, you could order the matches such that the least common were first and the most common were last. I think that would chunk down to your desired persons the fastest.
It would be interesting to see what the plan analyzer in the sheel says about the different approaches.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Count nodes with a certain property - neo4j

Related

Neo4J Get only the first relationship per node

Nodes with relationship to multiple nodes

Neo4j: multiple counts from multiple matches

Neo4j cypher query efficiency and syntax

Select nodes that has all relationships in Neo4j

Categories

Resources