Neo4j- typical query - returning a node with the most appearances - neo4j

I have to make a query that returns me a club or clubs, where play the most amount of players that are not representing the country, from where the club is.
My query works fine, but I want to filter, so my result is ONLY clubs that size is the most.
As for now the biggest size is 4, and I have 4 clubs that have 4 players which were supposed to be there.
The only thing comes to my mind to filter it out was by using LIMIT 1 in the end, but then, I cut out three clubs, that also fill the predicate.
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size
RETURN c,list_players,country,size ORDER BY size DESC LIMIT 1
edit:
I managed to do something like this, don't know if it's optimal, but it is working:
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size
WITH c,list_players,country,size ORDER BY size DESC LIMIT 1
WITH size
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH size,c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size2 WHERE size(collect(p.name)) = size
RETURN c,list_players,country,size

If you install APOC Procedures, there is an aggregation function you can use to get the items associated with a maximum value, and this works even when multiple items are tied for that value: apoc.agg.maxItems()
The trouble now is that all the club-specific data needs to be encapsulated into the item itself, so you'll need to add them to a map and use the map as the item, and the size of the person collection as the value.
Also your aggregation isn't quite correct. You're collecting player names, but you have the country of the player as a part of the grouping key (when you aggregate, all non-aggregation terms form the grouping key), and that isn't likely want you want. Maybe you wanted the country of the club instead?
Try working from this:
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p) as list_players
WITH apoc.agg.maxItems({club:c, players:list_players}, size(list_players)) as maxResults
UNWIND maxResults.items as result
WITH result.club as c, [player IN result.players | player.name] as list_players, maxResults.value as size
RETURN c,list_players,size

Related

Co-occurence analysis in Neo4j database

Let's say I have a database with nodes of two types Candyjars and Candies. Every Candyjar (Candyjar1, Candyjar2...) has different number of candies of different types: CandyRed, CandyGreen etc..
Now let's say the end game here is to find how much is the probability of the various types of candies to occur together, and the covariance among them. Then I want to have relationships between each CandyType with an associated probabilities of co-occurence and covariance. Let's call this relationships OCCURS_WITH so that Candtype1 -[OCCURS_WITH]->Candytype2 and Candytype1 -[COVARIES]->Candytype2
I'd make a database with CandieTypes and CandyJars as nodes, make a relationship (cj:CandyJar)-[r:CONTAINS]->(ct:Candytype) where r can have an attribute to set "how many" candy of a type are cotained in the jar.
Noy my problems is that I don't understand how can i, in Cypher, make a query to assign the OCCURS_WITH relationship in an optimal manner. Would I have to iterate for every pair of Candies, counting the number of pairs that cooccurs in candyjars over the number of candyjars? Is there a way to do it for all of the possible pairs together?
When I try to do:
MATCH (ct1:Candytype)<-[r1:CONTAINS]-(cj:Candyjar)-[r2:CONTAINS]->(ct2:Candytype)
WHERE ct1<>ct2 AND ct1.name="CandyRed" AND ct2.name="CandyBlue"
RETURN ct1,r1,count(r1),cj1,ct2,r2,count(r2)
LIMIT 5
I cannot get the count of the relationships of the co-occurring candies that I would need to express the probability of co-occurrence.
Would I have to use something like python to do the calculations rather than try to make a statement in Cypher?
To get the count of how many times CandyRed and CandyBlue co-occur, you can use the following Cypher statement:
MATCH (ct1:Candytype)<-[:CONTAINS]-(:Candyjar)-[:CONTAINS]->(ct2:Candytype)
WHERE ct1.name="CandyRed" AND ct2.name="CandyBlue"
RETURN ct1,ct2, count(*) AS coOccur
LIMIT 5
If you want a query that will compare all the candy types, you can use:
MATCH (ct1:Candytype)<-[:CONTAINS]-(:Candyjar)-[:CONTAINS]->(ct2:Candytype)
WHERE id(ct1) < id(ct2)
RETURN ct1,ct2, count(*) AS coOccur
LIMIT 5

Correct order of operations in neo4j - LOAD, MERGE, MATCH, WITH, SET

I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.

Neo4j poor performance on ORDER BY

I have query like this
MATCH (p:Person)-[s:KNOWS]->(t:Person) WHERE s.state = "blocked"
WITH DISTINCT (t) AS user
SKIP 0
LIMIT 10
MATCH (user)<-[r:KNOWS { state: "blocked" }]-(p:Person)
RETURN user.username, SIZE(COLLECT(p.username)) as count
First problem is when I have SKIP for example 100, it's getting slower, any idea why?
Seconds problem is, when I try to add ORDER BY, for example ORDER BY p.createdAt which is date (indexed field), it's always timing out.
You might have better performance with a tweak to your initial MATCH:
MATCH (t:Person)
WHERE ()-[:KNOWS {state:"blocked"}]->(t)
WITH t AS user // no longer need DISTINCT here
...
For better performance you might consider creating a fulltext schema index on :KNOWS relationships by their state property, and use the fulltext index query procedure to do this initial lookup (this is assuming :KNOWS relationships always connect two :Person nodes).

Neo4j variable-length pattern matching tunning

Query:
PROFILE
MATCH(node:Symptom) WHERE node.symptom =~ '.*adult male.*|.*151.*'
WITH node
MATCH (node)-[*1..2]-(result:Disease)
RETURN result
Profile:
enter image description here
Problems:
There are over 40 thousand "Symptom" nodes in the database, and the query is very slow because of the part - "[*1..2]".
It only took 4 seconds when length is 1, i.e "[*1]", but it will take about 30 seconds when length is 2, i.e "[*1..2]".
Is there any way to tune this query???
Firstly your query is using the regex operator, and it can't use indexes. You should use the CONTAINS operator instead :
MATCH (node:Symptom)
WHERE node.symptom CONTAINS 'adult male' OR node.symptom CONTAINS '151'
RETURN node
And you can create an index :CREATE INDEX ON :Symptom(symptom)
For the second part of your query, as it, there is nothing to do ... it's due to the complexity you are asking to do.
So to have better performances, you should think to :
put the relationship type on the pattern to reduce the number returned path : (node)-[*1..2:MY_REL_TYPE]-(result:Disease)
put the direction on the relationship on the pattern to reduce the number returned path : (node)-[*1..2:MY_REL_TYPE]->(result:Disease)
find an other way to reduce this complexity (filter on a property of the relationship , review your model, etc)
For your information, you can directly write your query in one step (ie. without the WITH, but in your case performances should be the same) :
MATCH (node:Symptom)-[*1..2]-(result:Disease)
WHERE node.symptom CONTAINS 'adult male' OR node.symptom CONTAINS '151'
RETURN result

Can Neo4j be effectively used to show a collection of nodes in a sortable and filterable table?

I realise this may not be ideal usage, but apart from all the graphy goodness of Neo4j, I'd like to show a collection of nodes, say, People, in a tabular format that has indexed properties for sorting and filtering
I'm guessing the Type of a node can be stored as a Link, say Bob -> type -> Person, which would allow us to retrieve all People
Are the following possible to do efficiently (indexed?) and in a scalable manner?
Retrieve all People nodes and display all of their names, ages, cities of birth, etc (NOTE: some of this data will be properties, some Links to other nodes (which could be denormalised as properties for table display's and simplicity's sake)
Show me all People sorted by Age
Show me all People with Age < 30
Also a quick how to do the above (or a link to some place in the docs describing how) would be lovely
Thanks very much!
Oh and if the above isn't a good idea, please suggest a storage solution which allows both graph-like retrieval and relational-like retrieval
if you want to operate on these person nodes, you can put them into an index (default is Lucene) and then retrieve and sort the nodes using Lucene (see for instance How do I sort Lucene results by field value using a HitCollector? on how to do a custom sort in java). This will get you for instance People sorted by Age etc. The code in Neo4j could look like
Transaction tx = neo4j.beginTx();
idxManager = neo4j.index()
personIndex = idxManager.forNodes('persons')
personIndex.add(meNode,'name',meNode.getProperty('name'))
personIndex.add(youNode,'name',youNode.getProperty('name'))
tx.success()
tx.finish()
'*** Prepare a custom Lucene query context with Neo4j API ***'
query = new QueryContext( 'name:*' ).sort( new Sort(new SortField( 'name',SortField.STRING, true ) ) )
results = personIndex.query( query )
For combining index lookups and graph traversals, Cypher is a good choice, e.g.
START people = node:people_index(name="E*") MATCH people-[r]->() return people.name, r.age order by r.age asc
in order to return data on both the node and the relationships.
Sure, that's easily possible with the Neo4j query language Cypher.
For example:
start cat=node:Types(name='Person')
match cat<-[:IS_A]-person-[born:BORN]->city
where person.age > 30
return person.name, person.age, born.date, city.name
order by person.age asc
limit 10
You can experiment with it in our cypher console.

Resources