Neo4j - Return Specific Limited-Number of Relationships/Nodes depending on a calculation - neo4j

How is it possible to return a limited number of nodes/relationships depending on a calculation?
I prepared one example:
Imagine we have 4 users (nodes) who each collect (relationship) 40kg Apples with a created-timestamp. On the other hand, we have a basket (node) which can take 100kg. How is it possible to return only the oldest 3 relationships, because the basket can be filled with those 3 relationships?
In other words:
The sum of collected-kilos of the 3 oldest relationships is over the basket size of 100kg. If we would take only the 2 oldest relationships we would have a sum of collected-kilos of 80, which is too less for the basket. Taking all 4 relationships would result in 160kg, which is far too much.
The background of my question is to reduce the size of query-return. If a query would return all 4 relationships and one would always ignore the 4th relationship, it makes it unneeded beforehand.
Thank you
The schema of the example looks like this:
In Neo4j the relationships/nodes could be created by this:
create
(_0:`Basket` {`size`:100}),
(_1:`User` {`name`:"Franc"}),
(_6:`User` {`name`:"Peter"}),
(_34:`User` {`name`:"Betty"}),
(_35:`User` {`name`:"Rita"}),
(_54:`Fruit` ),
(_1)-[:`COLLECTS` {`created`:"20221206212715",`kilos`:40,`type`:"Apples"}]->(_54),
(_6)-[:`COLLECTS` {`created`:"20221206212417",`kilos`:40,`type`:"Apples"}]->(_54),
(_34)-[:`COLLECTS` {`created`:"20221206212547",`kilos`:40,`type`:"Apples"}]->(_54)
(_35)-[:`COLLECTS` {`created`:"20221206212815",`kilos`:40,`type`:"Apples"}]->(_54)

Related

Co-occurence analysis in Neo4j database

Let's say I have a database with nodes of two types Candyjars and Candies. Every Candyjar (Candyjar1, Candyjar2...) has different number of candies of different types: CandyRed, CandyGreen etc..
Now let's say the end game here is to find how much is the probability of the various types of candies to occur together, and the covariance among them. Then I want to have relationships between each CandyType with an associated probabilities of co-occurence and covariance. Let's call this relationships OCCURS_WITH so that Candtype1 -[OCCURS_WITH]->Candytype2 and Candytype1 -[COVARIES]->Candytype2
I'd make a database with CandieTypes and CandyJars as nodes, make a relationship (cj:CandyJar)-[r:CONTAINS]->(ct:Candytype) where r can have an attribute to set "how many" candy of a type are cotained in the jar.
Noy my problems is that I don't understand how can i, in Cypher, make a query to assign the OCCURS_WITH relationship in an optimal manner. Would I have to iterate for every pair of Candies, counting the number of pairs that cooccurs in candyjars over the number of candyjars? Is there a way to do it for all of the possible pairs together?
When I try to do:
MATCH (ct1:Candytype)<-[r1:CONTAINS]-(cj:Candyjar)-[r2:CONTAINS]->(ct2:Candytype)
WHERE ct1<>ct2 AND ct1.name="CandyRed" AND ct2.name="CandyBlue"
RETURN ct1,r1,count(r1),cj1,ct2,r2,count(r2)
LIMIT 5
I cannot get the count of the relationships of the co-occurring candies that I would need to express the probability of co-occurrence.
Would I have to use something like python to do the calculations rather than try to make a statement in Cypher?
To get the count of how many times CandyRed and CandyBlue co-occur, you can use the following Cypher statement:
MATCH (ct1:Candytype)<-[:CONTAINS]-(:Candyjar)-[:CONTAINS]->(ct2:Candytype)
WHERE ct1.name="CandyRed" AND ct2.name="CandyBlue"
RETURN ct1,ct2, count(*) AS coOccur
LIMIT 5
If you want a query that will compare all the candy types, you can use:
MATCH (ct1:Candytype)<-[:CONTAINS]-(:Candyjar)-[:CONTAINS]->(ct2:Candytype)
WHERE id(ct1) < id(ct2)
RETURN ct1,ct2, count(*) AS coOccur
LIMIT 5

Neo4j- typical query - returning a node with the most appearances

I have to make a query that returns me a club or clubs, where play the most amount of players that are not representing the country, from where the club is.
My query works fine, but I want to filter, so my result is ONLY clubs that size is the most.
As for now the biggest size is 4, and I have 4 clubs that have 4 players which were supposed to be there.
The only thing comes to my mind to filter it out was by using LIMIT 1 in the end, but then, I cut out three clubs, that also fill the predicate.
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size
RETURN c,list_players,country,size ORDER BY size DESC LIMIT 1
edit:
I managed to do something like this, don't know if it's optimal, but it is working:
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size
WITH c,list_players,country,size ORDER BY size DESC LIMIT 1
WITH size
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH size,c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size2 WHERE size(collect(p.name)) = size
RETURN c,list_players,country,size
If you install APOC Procedures, there is an aggregation function you can use to get the items associated with a maximum value, and this works even when multiple items are tied for that value: apoc.agg.maxItems()
The trouble now is that all the club-specific data needs to be encapsulated into the item itself, so you'll need to add them to a map and use the map as the item, and the size of the person collection as the value.
Also your aggregation isn't quite correct. You're collecting player names, but you have the country of the player as a part of the grouping key (when you aggregate, all non-aggregation terms form the grouping key), and that isn't likely want you want. Maybe you wanted the country of the club instead?
Try working from this:
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p) as list_players
WITH apoc.agg.maxItems({club:c, players:list_players}, size(list_players)) as maxResults
UNWIND maxResults.items as result
WITH result.club as c, [player IN result.players | player.name] as list_players, maxResults.value as size
RETURN c,list_players,size

Correct order of operations in neo4j - LOAD, MERGE, MATCH, WITH, SET

I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.

Neo4j cyper performance on simple match

I have a very simple cypher which give me a poor performance.
I have approx. 2 million user and 60 book category with relation from user to category around 28 million.
When I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN distinct(bc.id);
It returns me 8.5k rows within 2 - 2.5 (First time) minutes
And when I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN u.id, u.email, read.timestamp;
It return 55k rows within 3 to 6 (First time) minutes.
I already have index on User id and email, but still I don't think this performance is acceptable. Any idea how can I improve this?
First of all, you can profile your query, to find what happens under the hood.
Currently looks like that query scans all nodes in database to complete query.
Reasons:
Neo4j support indexes only for '=' operation (or 'IN')
To complete query, it traverses all nodes, one by one, checking each node if it has valid timestamp
There is no straightforward way to deal with this problem.
You should look into creating proper graph structure, to deal with Time-specific queries more efficiently. There are several ways how to represent time in graph databases.
You can take look on graphaware/neo4j-timetree library.
Can you explain your model a bit?
Where are the books and the "reading"-Event in it?
Afaik all you want to know, which book categories have been recently read (in the last month)?
You could create a second type of relationship thats RECENTLY_READ which expires (is deleted) by a batch job it is older than 30 days. (That can be two simple cypher statements which create and delete those relationships).
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[read:READ]->(b:BookCategory)
WHERE read.timestamp >= timestamp() - month
MERGE (a)-[rr:RECENTLY_READ]->(b)
WHERE coalesce(rr.timestamp,0) < read.timestamp
SET rr.timestamp = read.timestamp;
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[rr:RECENTLY_READ]->(b:BookCategory)
WHERE rr.timestamp < timestamp() - month
DELETE rr;
There is another way to achieve what you exactly want to do here, but it's unfortunately not possible in Cypher.
With a relationship-index on timestamp on your read relationship you can run a Lucene-NumericRangeQuery in Neo4j's Java API.
But I wouldn't really recommend to go down this route.

Neo4j Facet Fields like Solr

I'm using Neo4j in a community e-commerce built in PHP and using the REST interface.
I need to get all categories related to the search results like Amazon. This feature is available in other engines like Solr (another implementation of Lucene) as Faceted Search
How can I do a Faceted Search in Neo4j? or What's the best way (performance grade) to recreate this feature?
All required modules related to this feature are excluded from the core package of neo4j. I want to know if someone try to do something like this without transverse all nodes in the graph, grab some properties, and make a groupCount of this values. If we have 200k nodes, the transverse took 10sec to only get the categories.
This is my Gremlin approach.
(new Neo4jVertexSequence(
g.getRawGraph().index().forNodes('products').query(
new org.neo4j.index.lucene.QueryContext('category:?')
), g
))._().groupBy{it.category}.cap.next();
Results in 90 rows and took 54 seconds.
Books = 12002
Movies_Music_Games = 19233
Electronics_Computers = 60540
Home_Garden_Tools = 9123
Grocery_Health_Beauty = 15643
Toys_Kids_Baby = 15099
Clothing_Shoes_Jewelry = 12543
Sports_Outdoors = 10342
Automotive_Industrial = 9638
... (more rows)
Of course, I can't put this results in cache, because, this is for "non input search". If the user makes a query like "Iphone", the query looks like
(new Neo4jVertexSequence(
g.getRawGraph().index().forNodes('products').query(
new org.neo4j.index.lucene.QueryContext('search:"iphone" AND category:?')
), g
))._().groupBy{it.category}.cap.next();
What about your domain model? Did you just put everything in the index? Usually you would model your categories as nodes and have your products being related to the category nodes.
(product)-[:HAS_CATEGORY]->(category)<-[:IS_CATEGORY]-(categories)
In your query you would just traverse this little tree and count the relationships of type :HAS_CATEGORY starting from each category node.
start categories=node(x)
match (product)-[:HAS_CATEGORY]->(category)<-[:IS_CATEGORY]-(categories)
return category.name, count(*)

Resources