Clustering Series of Events - machine-learning

I have a time-series of events, which I need to cluster in groups. These events do not always come in same amounts: at any given time there can be 1 event and on other times there can be n events.
Just for reference dataset looks like:
Time | Events
t1 | [A, B, C]
t2 | [B, E, F]
t3 | [B, E, G, H, K]
t4 | [A, B, C, D]
Question: I am trying to see how many clusters of such events exist based on some sort of similarity. How to solve this problem for both situations when we care for sequence of events at time t and when we do not care in which sequence the events occur at time t (A,B,C is similar to B,C,A).

So, affinity propagation will automatically figure out the optimal number of clusters for you (or at least it is supposed to do that). Check out the link below and try to adapt that to your specific needs.
https://scikit-learn.org/stable/auto_examples/cluster/plot_affinity_propagation.html#sphx-glr-auto-examples-cluster-plot-affinity-propagation-py
In short, vectorize your specific data and fee that into the algo. Post back if you get stuck doing that.

Related

Cypher Query to Collect Arbitrary Depth Nodes and Edge Properties

I have a graph that looks like the image below. However, the depth and the number of rollups from the Person to the topmost Rollup is variable depending on how the rollups have been structured by the user. The edges from the Person to the Metric (HAS_METRIC) have the score values and the relationships from the metrics to the Rollup (HAS_PARENT) has the weighting that should be applied by to the value as it is rolled up to a top score.
Ideally, I would like to have a query that produces a table with the rollup and the summed/weighted scores. Like this:
node | value
-------------------
Metric A 23
Metric B 55
Metric C 29
Metric D 78
Rollup A 45.4
Rollup B 58.4
Rollup Tot 51.9
However, I am not understanding how to collect the edge properties for the HAS_PARENTS.
MATCH (p:Person)-[score:HAS_METRIC]->(m:Metric)-[weight:HAS_PARENT]->(ru:Rollup)
-[par_rel:HAS_PARENT*..8]->(ru_par:Rollup)
WITH p, score, m, weight, par_rel, ru, ru_par
RETURN p.uid, score.score, m.uid, weight.weight, ru.uid par_rel.weight, ru_par.uid
This query is giving me a type mismatch because it does not know what to do with the par_rel.weight. Any pointers are appreciated.
I believe what you are searching for is the relationships(path) function. It is one of the default path functions in Cypher. It returns all relationship is a defined path, and you can combine it with one or more Cypher list expressions to get the values you need from the relationships.
Generally speaking, you could do something like:
MATCH p = (n)-[:HAS_PARENT*..8]->()
RETURN [x IN relationships(p) | x.weight] AS weights
You might also find useful the reduce function. E.g.:
...
RETURN reduce(s = 0, x IN relationships(p) | s + x.weight) AS sumWeight
But you need to be careful with your variable length path queries and probably constrain them in order to get only the paths you are interested in.
A good advice would be probably to mark your leaf and root nodes in order to match only paths from a leaf to a/the root, not just intermediate ones. E.g.:
MATCH p = (n)-[:HAS_PARENT*..8]->(root)
WHERE NOT (root)-[:HAS_PARENT]->() AND NOT (n)<-[:HAS_PARENT]-()
...
And of course you can combine these cypher with others in order to return everything you need in one single query.
I hope this helps. Let us know when you succeed.

neo4j get random path from known node

I have a big neo4j db with info about celebs, all of them have relations with many others, they are linked, dated, married to each other. So I need to get random path from one celeb with defined count of relations (5). I don't care who will be in this chain, the only condition I have I shouldn't have repeated celebs in chain.
To be more clear: I need to get "new" chain after each query, for example:
I try to get chain started with Rita Ora
She has relations with
Drake, Jay Z and Justin Bieber
Query takes random from these guys, for example Jay Z
Then Query takes relations of Jay Z: Karrine
Steffans, Rosario Dawson and Rita Ora
Query can't take Rita Ora cuz
she is already in chain, so it takes random from others two, for
example Rosario Dawson
...
And at the end we should have a chain Rita Ora - Jay Z - Rosario Dawson - other celeb - other celeb 2
Is that possible to do it by query?
This is doable in Cypher, but it's quite tricky. You mention that
the only condition I have I shouldn't have repeated celebs in chain.
This condition could be captured by using node-isomorphic pattern matching, which requires all nodes in a path to be unique. Unfortunately, this is not yet supported in Cypher. It is proposed as part of the openCypher project, but is still work-in-progress. Currently, Cypher only supports relationship uniqueness, which is not enough for this use case as there are multiple relationship types (e.g. A is married to B, but B also collaborated with A, so we already have a duplicate with only two nodes).
APOC solution. If you can use the APOC library, take a look at the path expander, which supports various uniqueness constraints, including NODE_GLOBAL.
Plain Cypher solution. To work around this limitation, you can capture the node uniqueness constraint with a filtering operation:
MATCH p = (c1:Celebrity {name: 'Rita Ora'})-[*5]-(c2:Celebrity)
UNWIND nodes(p) AS node
WITH p, count(DISTINCT node) AS countNodes
WHERE countNodes = 5
RETURN p
LIMIT 1
Performance-wise this should be okay as long as you limit its results because the query engine will basically keep enumerating new paths until one of them passes the filtering test.
The goal of the UNWIND nodes(p) AS node WITH count(DISTINCT node) ... construct is to remove duplicates from the list of nodes by first UNWIND-ing it to separate rows, then aggregating them to a unique collection using DISTINCT. We then check whether the list of unique nodes still has 5 elements - if so, the original list was also unique and we RETURN the results.
Note. Instead of UNWIND and count(DISTINCT ...), getting unique elements from a list could be expressed in other ways:
(1) Using a list comprehension and ranges:
WITH [1, 2, 2, 3, 2] AS l
RETURN [i IN range(0, length(l)-1) WHERE NOT l[i] IN l[0..i] | l[i]]
(2) Using reduce:
WITH [1, 2, 2, 3, 2] AS l
RETURN reduce(acc = [], i IN l | acc + CASE NOT i IN acc WHEN true THEN [i] ELSE [] END)
However, I believe both forms are less readable than the original one.

Filtering out nodes on two cypher paths

I have a simplified Neo4j graph (old version 2.x) as the image with 'defines' and 'same' edges. Assume the number on the define edge is a property on the edge
The queries I would like to run are:
1) Find nodes defined by both A and B -- Requried result: C, C, D
START A=node(885), B=node(996) MATCH (A-[:define]->(x)<-[:define]-B) RETURN DISTINCT x
Above works and returns C and D. But I want C twice since its defined twice. But without the distinct on x, it returns all the paths from A to B.
2)Find nodes that are NOT (defined by both A,B OR are defined by both A,B but connected via a same edge) -- Required result: G
Something like:
R1: MATCH (A-[:define]->(x)<-[:define]-B) RETURN DISTINCT x
R2: MATCH (A-[:define]->(e)-(:similar)-(f)<-[:define]-B) RETURN e,f
(Nodes defined by A - (R1+R2) )
3) Find 'middle' nodes that do not have matching calls from both A and B --Required result: C,G
I want to output C due to the 1 define(either 45/46) that does not have a matching define from B.
Also output G because there's no define to G from B.
Appreciate any help on this!
Your syntax is a bit strange to me, so I'm going to assume you're using an older version of Neo4j. We should be able to use the same approaches, though.
For #1, Your proposed match without distinct really should be working. The only thing I can see is adding missing parenthesis around A and B node variables.
START A=node(885), B=node(996)
MATCH (A)-[:define]->(x)<-[:define]-(B)
RETURN x
Also, I'm not sure what you mean by "returns all paths from A to B." Can you clarify that, and provide an example of the output?
As for #2, we'll need several several parts to this query, separating them with WITH accordingly.
START A=node(885), B=node(996)
MATCH (A)-[:define]->(x)<-[:define]-(B)
WITH A, B, COLLECT(DISTINCT x) as exceptions
OPTIONAL MATCH (A)-[:define]->(x)-[:same]-(y)<-[:define]-(B)
WHERE x NOT IN exceptions AND y NOT IN exceptions
WITH A, B, exceptions + COLLECT(DISTINCT x) + COLLECT(DISTINCT y) as allExceptions
MATCH (aNode)
WHERE aNode NOT IN allExceptions AND aNode <> A AND aNode <> B
RETURN aNode
Also, you should really be using labels on your nodes. The final match will match all nodes in your graph and will have to filter down otherwise.
EDIT
Regarding your #3 requirement, the SIZE() function will be very helpful here, as you can get the size of a pattern match, and it will tell you the number of occurrences of that pattern.
The approach on this query is to first get the collection of nodes defined by A or B, then filter down to the nodes where the number of :defines relationships from A are not equal to the number of :defines relationships from B.
While we would like to use something like a UNION WITH in order to get the union of nodes defined by A and union it with the nodes defined by B, Neo4j's UNION support is weak right now, as it doesn't let you do any additional operations after the UNION happens, so instead we have to resort to adding both sets of nodes into the same collection then unwinding them back into rows.
START A=node(885), B=node(996)
MATCH (A)-[:define]->(x)
WITH A, B, COLLECT(x) as middleNodes
MATCH (B)-[:define]->(x)
WITH A, B, middleNodes + COLLECT(x) as allMiddles
UNWIND allMiddles as middle
WITH DISTINCT A, B, middle
WHERE SIZE((A)-[:define]->(middle)) <> SIZE((B)-[:define]->(middle))
RETURN middle

Schema for storing historical transaction data in Neo4J?

I have around 200 entities that have invested in a company over the last 30 years. I have been tracking how much money they contributed over time. My database will be in Neo4J.
So far on my graph I have (1) 200 nodes representing the 200 entities that have invested and (2) 1 node representing the single company they invested in.
I see two options for me to represent the capital infusions:
I explicitly create 1,500 nodes representing each of the initial capital infusion, capital increase, etc. The nodes captures information on changes in dollar amounts, etc. Then my graph is roughly this (e:Entity)-[:PROVIDES]->(f:Financing {amount: {value}, year: {2010}})-[:PROVIDES]->(t:Target). In some way, I find this much cleaner and easier for analysis down the road but this will be a larger graph and the PROVIDES relationships are not particularly insightful.
I represent those 1,500 financing rounds much more directly as relationships between the 200 entities and the target company
(e:Entity)-[:FINANCING {amount: {value}, year: {2010}}]->(t:Target). In that case, I'm a bit unsure about how to handle the analysis afterwards or whether it makes sense to have say 50 FINANCING relationships between Entity X and the target company.
The type of analysis I'd like to do would include (1) generating the target entity ownership say in year 2004, (2) generating the evolution over time of shareholding in the target company by entity X, etc.
What would you recommend as a solution for the schema? I know Neo4J is schema-optional but I suspect this choice between nodes and relationships matters.
Many thanks!
Cheers
For data that is going to be frequently queried but has a limited, finite number of possible values (like years, especially for only 30 years) a lot of times you'll see better performance if you move that year property onto a separate node, so that you can quickly group all of the nodes that attach to it and fetch its year property once, instead of essentially re-creating a property index for it. That necessitates adding a :Financing node in this case, so that you can hook up :Entity, :Target, and the :Year nodes all to the same transaction record.
So your data model would be like:
(:Entity) - [:PROVIDES] -> (:Financing {amount: x}) - [:PROVIDES] -> (:Target)
(:Financing) - [:OCCURRED_IN] -> (:Year {year: 1999})
thereby allowing you to slice your data by Year value without having to scan all of your nodes for properties. You could also put a property index on :Financing(year), but modelling limited, discrete properties like year as a separate path allows you to more easily extend your graph, and makes good query performance easier to achieve.
Either way, though, you will definitely want a :Financing node in the middle. Properties on relationships should rarely be used for anything except being returned in a result; they can't be indexed, so they are always going to require a property scan to get a result, and if you have a lot of relationships, that can add up fast.
Starter queries (assuming that ownership is % of total amount provided up to a given point), to get % ownership by entity at the end of 2004:
MATCH (t:Target {id: 1})
WITH t
MATCH (y:Year)
WHERE y.year <= 2004
WITH t, y
MATCH (y) <- [:OCCURRED_IN] - (f:Financing) - [:PROVIDED] -> (t)
WITH f, f.amount as amt
WITH COLLECT({f: f, amt: amt}) AS rows, SUM(amt) AS total
UNWIND rows AS row
WITH row.f as f, row.amt as amt, total
MATCH (e:Entity) - [:PROVIDED] -> (f)
WITH e, SUM(amt) AS part, total
RETURN e, part/total * 100 AS percentage
And to get Entity 2 (arbitrary identifier)'s proportion of financing provided each year:
MATCH (t:Target {id:1})
WITH t
MATCH (y:Year)
WITH t, y
MATCH (y) <- [:OCCURRED_IN] - (f:Financing) - [:PROVIDED] -> (t)
WITH y, f, f.amount as amt
WITH y, COLLECT({f: f, amt: amt}) AS rows, SUM(amt) AS total_per_y
UNWIND rows AS row
WITH y, row.f as f, row.amt as amt, total_per_y
MATCH (f) <- [:PROVIDED] - (:Entity {id:2})
WITH y, total_per_y, SUM(amt) AS part_per_y
RETURN y.year, part_per_y/total_per_y*100 AS percentage

neo4j cypher: stacking results with UNION and WITH

I'm doing a query like
MATCH (a)
WHERE id(a) = {id}
WITH a
MATCH (a)-->(x:x)-->(b:b)
WITH a, x, b
MATCH (a)-->(y:y)-->(b:b)
WITH a, x, y, b
MATCH (b)-->(c:c)
RETURN collect(a), collect(x), collect(y), collect(b), collect(c)
what I want here is to have the b from MATCH (a)-->(y:y)-->(b:b) to be composed of the ones from that line and the ones from the previous MATCH (a)-->(x:x)-->(b:b). The problem I'm having with UNION is that its picky about the number and kind of nodes to be passed on the next query, and I'm having trouble understanding how to make it all go together.
What other solution could I use to merge these nodes during the query or just before returning them? (Or if should I do it with UNION then how to do it that way...)
(Of course the query up there could be done in other better ways. My real one can't. That is just meant to give a visual example of what I'm looking to do.)
Much obliged!
This simplified query might suit your needs.
I took out all the collect() function calls, as it is not clear that you really need to aggregate anything. For example, there will only be a single 'a' node, so aggregating the 'a's does not make sense.
Please be aware that every row of the result will be for a node labelled either 'x' or 'y'. But, since every row has to have both the x and y values -- every row will have a null value for one of them.
START a=node({id})
MATCH (a)-->(x:x)-->(b:b)-->(c:c)
RETURN a, x, null AS y, b, c
UNION
MATCH (a)-->(y:y)-->(b:b)-->(c:c)
RETURN a, null AS x, y, b, c
The best solution I could come up in the end was something like this
MATCH (a)-->(x:x)-->(b1:b)-->(c1:c)
WHERE id(a) = {id} AND NOT (a)-->(:y)-->(b1)
WITH a, collect(x) as xs, collect(DISTINCT b1) as b1s, collect(c1) as c1s
MATCH (a)-->(y:y)-->(b2:b)-->(c2:c)
RETURN a, xs, collect(y), (b1s + collect(b2)), c1s + collect(c2)

Resources