Schema for storing historical transaction data in Neo4J? - neo4j

I have around 200 entities that have invested in a company over the last 30 years. I have been tracking how much money they contributed over time. My database will be in Neo4J.
So far on my graph I have (1) 200 nodes representing the 200 entities that have invested and (2) 1 node representing the single company they invested in.
I see two options for me to represent the capital infusions:
I explicitly create 1,500 nodes representing each of the initial capital infusion, capital increase, etc. The nodes captures information on changes in dollar amounts, etc. Then my graph is roughly this (e:Entity)-[:PROVIDES]->(f:Financing {amount: {value}, year: {2010}})-[:PROVIDES]->(t:Target). In some way, I find this much cleaner and easier for analysis down the road but this will be a larger graph and the PROVIDES relationships are not particularly insightful.
I represent those 1,500 financing rounds much more directly as relationships between the 200 entities and the target company
(e:Entity)-[:FINANCING {amount: {value}, year: {2010}}]->(t:Target). In that case, I'm a bit unsure about how to handle the analysis afterwards or whether it makes sense to have say 50 FINANCING relationships between Entity X and the target company.
The type of analysis I'd like to do would include (1) generating the target entity ownership say in year 2004, (2) generating the evolution over time of shareholding in the target company by entity X, etc.
What would you recommend as a solution for the schema? I know Neo4J is schema-optional but I suspect this choice between nodes and relationships matters.
Many thanks!
Cheers

For data that is going to be frequently queried but has a limited, finite number of possible values (like years, especially for only 30 years) a lot of times you'll see better performance if you move that year property onto a separate node, so that you can quickly group all of the nodes that attach to it and fetch its year property once, instead of essentially re-creating a property index for it. That necessitates adding a :Financing node in this case, so that you can hook up :Entity, :Target, and the :Year nodes all to the same transaction record.
So your data model would be like:
(:Entity) - [:PROVIDES] -> (:Financing {amount: x}) - [:PROVIDES] -> (:Target)
(:Financing) - [:OCCURRED_IN] -> (:Year {year: 1999})
thereby allowing you to slice your data by Year value without having to scan all of your nodes for properties. You could also put a property index on :Financing(year), but modelling limited, discrete properties like year as a separate path allows you to more easily extend your graph, and makes good query performance easier to achieve.
Either way, though, you will definitely want a :Financing node in the middle. Properties on relationships should rarely be used for anything except being returned in a result; they can't be indexed, so they are always going to require a property scan to get a result, and if you have a lot of relationships, that can add up fast.
Starter queries (assuming that ownership is % of total amount provided up to a given point), to get % ownership by entity at the end of 2004:
MATCH (t:Target {id: 1})
WITH t
MATCH (y:Year)
WHERE y.year <= 2004
WITH t, y
MATCH (y) <- [:OCCURRED_IN] - (f:Financing) - [:PROVIDED] -> (t)
WITH f, f.amount as amt
WITH COLLECT({f: f, amt: amt}) AS rows, SUM(amt) AS total
UNWIND rows AS row
WITH row.f as f, row.amt as amt, total
MATCH (e:Entity) - [:PROVIDED] -> (f)
WITH e, SUM(amt) AS part, total
RETURN e, part/total * 100 AS percentage
And to get Entity 2 (arbitrary identifier)'s proportion of financing provided each year:
MATCH (t:Target {id:1})
WITH t
MATCH (y:Year)
WITH t, y
MATCH (y) <- [:OCCURRED_IN] - (f:Financing) - [:PROVIDED] -> (t)
WITH y, f, f.amount as amt
WITH y, COLLECT({f: f, amt: amt}) AS rows, SUM(amt) AS total_per_y
UNWIND rows AS row
WITH y, row.f as f, row.amt as amt, total_per_y
MATCH (f) <- [:PROVIDED] - (:Entity {id:2})
WITH y, total_per_y, SUM(amt) AS part_per_y
RETURN y.year, part_per_y/total_per_y*100 AS percentage

Related

Neo4j Cypher return combination of matched nodes

I'd like to do some analytics over the database with the following schema:
As you may see on the picture above, there is a Vacancy1 which requires knowledge of Java and Python and proposes the salary equals 5000$. Also, there is a set of candidates which know everything (like Candidate5) a suit the salary (Cadidate5 desired salary equals 4950$), and candidates which know some skills, like only Java or Python but together know everything what is required on the vacancy, for example:
Candidate1(Java, 2000$), Candidate2(Python, 1500$)
such set of candidates together know Java and Python and a united salary is equal 3500$.
Is it possible to write the query in Neo4j in order to find all possible sets of candidates which suite such vacancy condition?
For example, for the picture above the result should contain, something like that:
[candidate5],
[candidate1, candidate2],
[candidate1, candidate4],
[candidate3, candidate2]
Please note, that the combinations of the candidates in the result may contain any number of candidates and not limited to only 1 or 2 as in the example above.
Could you please show an example of such Cypher query?
UPDATED
What if I need to take into account some additional properties, like for example experience, like minExp on the diagram below:
Here, we need a candidate for the Vacancy1 with minExp = 3
The Candidate2 has exp (experience) = 2 and is not a good fit from Java point of view, but in pair with Candidate3(exp = 5), they together is a good fit for the Vacancy1. Is it possible to improve the query in order to take this information into account and do such combinations?
I am a fan of NEO4J APOC functions so in APOC, theres is a function that gives all possble combinations on a given list. It returns a list of list with 1 or 2 or 3 or n items.
With ["Java", "Python"] as skills, size(skills) as n
Match (v:Vacancy)-[:CONTAINS]->(s:Skills)<-[:CONTAINS]-(c:Candidate)
Where s.language in skills and v.salary <= c.salary
With n, v, collect(c) as candidates
With v, apoc.coll.combinations(candidates, 1, n) as allCandidatesCombi
Unwind allCandidatesCombi as combi
With v, combi where apoc.coll.sum([c in combi |combi.salary]) <= v.salary
Return v, combi
References:
n is number of skills or candidates in the result
apoc.coll.combinations will give you all possible combinations of all candidates with 1 to n candiates
Unwind is like a for loop and gives you each item of that list one at a time
apoc.coll.sum will sum up the candidates salary

neo4j get random path from known node

I have a big neo4j db with info about celebs, all of them have relations with many others, they are linked, dated, married to each other. So I need to get random path from one celeb with defined count of relations (5). I don't care who will be in this chain, the only condition I have I shouldn't have repeated celebs in chain.
To be more clear: I need to get "new" chain after each query, for example:
I try to get chain started with Rita Ora
She has relations with
Drake, Jay Z and Justin Bieber
Query takes random from these guys, for example Jay Z
Then Query takes relations of Jay Z: Karrine
Steffans, Rosario Dawson and Rita Ora
Query can't take Rita Ora cuz
she is already in chain, so it takes random from others two, for
example Rosario Dawson
...
And at the end we should have a chain Rita Ora - Jay Z - Rosario Dawson - other celeb - other celeb 2
Is that possible to do it by query?
This is doable in Cypher, but it's quite tricky. You mention that
the only condition I have I shouldn't have repeated celebs in chain.
This condition could be captured by using node-isomorphic pattern matching, which requires all nodes in a path to be unique. Unfortunately, this is not yet supported in Cypher. It is proposed as part of the openCypher project, but is still work-in-progress. Currently, Cypher only supports relationship uniqueness, which is not enough for this use case as there are multiple relationship types (e.g. A is married to B, but B also collaborated with A, so we already have a duplicate with only two nodes).
APOC solution. If you can use the APOC library, take a look at the path expander, which supports various uniqueness constraints, including NODE_GLOBAL.
Plain Cypher solution. To work around this limitation, you can capture the node uniqueness constraint with a filtering operation:
MATCH p = (c1:Celebrity {name: 'Rita Ora'})-[*5]-(c2:Celebrity)
UNWIND nodes(p) AS node
WITH p, count(DISTINCT node) AS countNodes
WHERE countNodes = 5
RETURN p
LIMIT 1
Performance-wise this should be okay as long as you limit its results because the query engine will basically keep enumerating new paths until one of them passes the filtering test.
The goal of the UNWIND nodes(p) AS node WITH count(DISTINCT node) ... construct is to remove duplicates from the list of nodes by first UNWIND-ing it to separate rows, then aggregating them to a unique collection using DISTINCT. We then check whether the list of unique nodes still has 5 elements - if so, the original list was also unique and we RETURN the results.
Note. Instead of UNWIND and count(DISTINCT ...), getting unique elements from a list could be expressed in other ways:
(1) Using a list comprehension and ranges:
WITH [1, 2, 2, 3, 2] AS l
RETURN [i IN range(0, length(l)-1) WHERE NOT l[i] IN l[0..i] | l[i]]
(2) Using reduce:
WITH [1, 2, 2, 3, 2] AS l
RETURN reduce(acc = [], i IN l | acc + CASE NOT i IN acc WHEN true THEN [i] ELSE [] END)
However, I believe both forms are less readable than the original one.

neo4j - restrict query based on node's rank

I have a hierarchical structure of nodes, which all have a custom-assigned sorting property (numeric). Here's a simple Cypher query to recreate:
merge (p {my_id: 1})-[:HAS_CHILD]->(c1 { my_id: 11, sort: 100})
merge (p)-[:HAS_CHILD]->(c2 { my_id: 12, sort: 200 })
merge (p)-[:HAS_CHILD]->(c3 { my_id: 13, sort: 300 })
merge (c1)-[:HAS_CHILD]->(cc1 { my_id: 111 })
merge (c2)-[:HAS_CHILD]->(cc2 { my_id: 121 })
merge (c3)-[:HAS_CHILD]->(cc3 { my_id: 131 });
The problem I'm struggling with is that often I need to make decisions based on child node rank relative to some parent node, with regads to this sort identifier. So, for example, node c1 has rank 1 relative to node p (because it has the least sort property), c2 has rank 2, and c3 has rank 3 (the biggest sort).
The kind of decision I need to make based to this information: display children only of the first 2 cX nodes. Here's what I want to get:
cc1 and cc2 are present, but cc3 is not because c3 (its parent) is not the first or the second child of p. Here's a dumb query for that:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc) where c.sort <= 200
return p, c, cc
The problem is, these sort properties are custom-set and imported, so I have no way of knowing which value will be held for child number 2.
My current solution is to rank it during import, and since I'm using Oracle, that's quite simple -- I just need to use rank window function. But it seems awkward to me and I feel like there could be more elegant solution to that. I tried the next query and it works, but it looks weird and it's quite slow on bigger graphs:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc)
where size([ (p)-->(c1) where c1.sort < c.sort |c1]) < 2
return p, c, cc
Here's the plan for this query and the most expensive part is in fact the size expression:
The slowness you're seeing is likely because you're not performing an index lookup in your query, so it's performing an all nodes scan and accessing the my_id property of every node in your graph to find the one with id 1 (your p node).
You need to add labels on your nodes and use these labels in your queries (at least for your p node), and create an index (or in this case, probably a unique constraint) on the label for my_id so this lookup becomes fast.
You can confirm what's going on by doing a PROFILE of your query (if you can add the profile plan to your description, with all elements of the plan expanded that would help determine further optimizations).
As for your query, something like this should work (I'm using a :Node label as a standin for your actual label)
match (p:Node {my_id: 1 })-->(c)
with p, c
order by c.sort asc
with p, collect(c) as children // children are in order
unwind children[..2] as child // one row for each of the first 2 children
optional match (child)-->(cc) // only matched for the first 2 children
return p, children, collect(cc) as grandchildren
Note that this only returns nodes, not paths or relationships. The reason why you're getting the result graph in the graphical view is because, in the Browser Setting tab (the gear icon in the lower left menu) you have Connect result nodes checked at the bottom.

Optimizing a Cypher query to improve performance

The query I've written returns accurate results based on some random testing I've done. However, the query execution takes really long (7699.43 s)
I need help optimising this query.
count(Person) -> 67895
count(has_POA) -> 355479
count(POADocument) -> 40
count(issued_by) -> 40
count(Company) -> 21
count(PostCode) -> 9845
count(Town) -> 1673
count(in_town) -> 9845
count(offers_services_in) -> 17107
All the entity nodes are indexed on Id's (not Neo4j IDs). The PostCode nodes are also indexed on PostCode.
MATCH pa= (p:Person)-[r:has_POA]->(d:POADocument)-[:issued_by]->(c:Company),
(pc:PostCode),(t:Town) WHERE r.recipient_postcode=pc.PostCode AND (pc)-
[:in_town]->(t) AND NOT (c)-[:offers_services_in]->(t) RETURN p as Person,r
as hasPOA,t as Town, d as POA,c as Company
Much thanks in advance!
-Nancy
I made some changes in your query:
MATCH (p:Person)-[r:has_POA {recipient_code : {code} }]->(d:POADocument)-[:issued_by]->(c:Company),
(pc:PostCode {PostCode : {PostCode} })-[:in_town]->(t:Town)
WHERE NOT (c)-[:offers_services_in]->(t)
RETURN p as Person, r as hasPOA, t as Town, d as POA, c as Company
Since you are not using the entire path, removed pa variable
Moved the pattern existence check ((pc)-[:in_town]->(t)) from WHERE to MATCH.
Using parameters instead of the equality check r.recipient_postcode = pc.PostCode in where. If you are running the query in Neo4j Browser, you can set the parameters running the command :params {code : 10}.
Here is a simplified version of your current query.
MATCH (p:Person)-[r:has_POA]->(d:POADocument)-[:issued_by]->(c:Company)
MATCH (t:Town)<-[:in_town]-(pc:PostCode{PostCode:r.recipient_postcode})
WHERE NOT (c)-[:offers_services_in]->(t)
RETURN p as Person,r as hasPOA,t as Town, d as POA,c as Company
Your big performance hits are going to be on the Cartesian product between all the match sets, and the raw amount of data you are asking for.
In this simplified version, I'm using one less match, and the second match uses a variable from the first match to avoid generating a Cartesian product. I would also recommend using LIMIT and SKIP to page your results to limit data transfer.
If you can adjust your model, I would recommend converting the has_POA relation to an issued_POA node so that you can take advantage of Neo4j's relation finding on the 2 postcodes related to that instance, and making the second match a gimme instead of an extra indexed search (after you adjust the query to match the new model, of course).

neo4j cypher: "stacking" nodes from query result

Considering the existence of three types of nodes in a db, connected by the schema
(a)-[ra {qty}]->(b)-[rb {qty}]->(c)
with the user being able to have some of each in their wishlist or whatever.
What would be the best way to query the database to return a list of all the nodes the user has on their wishlist, considering that when he has an (a) then in the result the associated (b) and (c) should also be returned after having multiplied some of their fields (say b.price and c.price) for the respective ra.qty and rb.qty?
NOTE: you can find the same problem without the variable length over here
Assuming you have users connected to the things they want like so:
(user:User)-[:WANTS]->(part:Part)
And that parts, like you describe, have dependencies on other parts in specific quantities:
CREATE
(a:Part) -[:CONTAINS {qty:2}]->(b:Part),
(a:Part) -[:CONTAINS {qty:3}]->(c:Part),
(b:Part) -[:CONTAINS {qty:2}]->(c:Part)
Then you can find all parts, and how many of each, you need like so:
MATCH
(user:User {name:"Steven"})-[:WANTS]->(part),
chain=(part)-[:CONTAINS*1..4]->(subcomponent:Part)
RETURN subcomponent, sum( reduce( total=1, r IN relationships(chain) | total * r.rty) )
The 1..4 term says to look between 1-4 sub-components down the tree. You can obv. set that to whatever you like, including "1..", infinite depth.
The second term there is a bit complex. It helps to try the query without the sum to see what it does. Without that, the reduce will do the multiplying of parts that you want for each "chain" of dependencies. Adding the sum will then aggregate the result by subcomponent (inferred from your RETURN clause) and sum up the total count for that subcomponent.
Figuring out the price is then an excercise of multiplying the aggregate quantities of each part. I'll leave that as an exercise for the reader ;)
You can try this out by running the queries in the online console at http://console.neo4j.org/

Resources