How to cluster rows in dataset with two geospatial attributes

How to cluster rows in dataset with two geospatial attributes - machine-learning

Title might be misleading as I wasn't sure how to properly summarize the problem.
I have a dataset of trips with two locations (source and destination) and also other attributes (about customer, cargo, equipment, etc).
Are there any algorithms that I could apply in order to cluster those trips, given that I want to use both spatial points (source and destination) for clustering, not just one.
Let's say if I have following trips:
A1 -> B1
A2 -> B2
A1 -> C1
A2 -> C2
I want to get clusters like:
A -> B
A -> C

A very simple solution I can think about is to cluster each location independently, and then use cluster ids to group by both clusters.
Something like (this was tested in Google BigQuery):
with data as (
select st_geogpoint(100, 50) a, st_geogpoint(101, 51) b
union all
select st_geogpoint(100.01, 50) a, st_geogpoint(101.01, 51) b
union all
select st_geogpoint(100, 50.01) a, st_geogpoint(90, 51) b
union all
select st_geogpoint(100.01, 50.01) a, st_geogpoint(90, 51.01) b
),
clusters as (
select
a, b,
st_clusterdbscan(a, 1e4, 1) OVER() a_id,
st_clusterdbscan(b, 1e4, 1) OVER() b_id
from data
)
select
a_id, b_id,
st_centroid_agg(a) a_center,
st_centroid_agg(b) b_center
from clusters
group by a_id, b_id
a_id b_id a_center b_center
0 0 POINT(100.005 50.0000001074259) POINT(101.005 51.0000001066994)
0 1 POINT(100.005 50.0100001074192) POINT(90 51.005)

Related

Cypher count instances greater than

I am writing a query to display a graph comprising of all the journals and their publication place (cities). I would like to filter the query by selecting only the Cities which are the publication place of more than 3 journals.
My attempt does give me cities and the count but I cannot manage to have the journal.name and the relationship in the result
MATCH (j:journal)-[p:publication_city]->(c:City)
WITH c, count(c) as cnt
WHERE cnt > 3
RETURN c, cnt
ORDER BY cnt
Whatever change to add the journal variable in the query above (e.g. WITH c, count(c) as cnt, j) lead to empty result
Anyone who knows what I am doing wrong?

You can use COLLECT clause to get all journals with more than 3 publications. Then UNWIND to list them out one by one. UNWIND is like a "for loop" in sql.
MATCH (j:journal)-[:publication_city]-(c:city)
WITH c, count(c) as cnt, collect(j) as journals WHERE cnt > 3
UNWIND journals as journal
RETURN journal, c, cnt
ORDER BY cnt

Store a collection as a node in Neo4j

I have a query that returns two variables A & B. A returns collections that contain a variable number of values. B is a mathematical score related to each collection. I want to store A as nodes in the database with property B. Then if possible to check each node of A against the other nodes and create a relationship if both the values of B> some number. Is this possible in Neo4j?
An example of what my A and B looks like
MATCH (t:Trans)-[:CONTAINS]->(i2:Item), (t:Trans)-[:CONTAINS]->(i1:Item), (t:Trans)-[:CONTAINS]->(i3:Item)
WITH i1, i2, i3, COUNT(*) as c
WHERE c>100
WITH COLLECT({i1: i1, i2:i2, i3:i3, c: c}) AS data
UNWIND data AS d
WITH COLLECT({i1:d.i1.I_ID, i2:d.i2.I_ID, i3:d.i3.I_ID}) as Itemset, d
RETURN Itemset, d.c as NumTransactions
A B
{a,b,c} 45
{e,f} 23
{a,e,f} 12
{d} 89

[UPDATED]
Your query is doing a lot of unnecessary work. Also, your Itemset is always a list containing a single map, so having a list seems unnecessary.
This query should be equivalent (except that Itemset is just a map):
MATCH (t:Trans)-[:CONTAINS]->(i2:Item), (t)-[:CONTAINS]->(i1:Item), (t)-[:CONTAINS]->(i3:Item)
WITH {i1: i1.I_ID, i2: i2.I_ID, i3: i3.I_ID} AS Itemset, COUNT(*) as NumTransactions
WHERE NumTransactions > 100
RETURN Itemset, NumTransactions
It feels to me like you should not be storing the Itemset in an A node. Here is an example of how to create relationships from each Item to the same B node:
MATCH (t:Trans)-[:CONTAINS]->(i2:Item), (t)-[:CONTAINS]->(i1:Item), (t)-[:CONTAINS]->(i3:Item)
WITH i1.I_ID AS id1, i2.I_ID AS id2, i3.I_ID AS id3, COUNT(*) as c
WHERE c > 100
MERGE (b:B {numTransactions: c})
MERGE (i1)-[:FOO]->(b)
MERGE (i2)-[:FOO]->(b)
MERGE (i3)-[:FOO]->(b)

The second part looks easy enough:
give the A nodes a distinct label
MATCH (n:A_LABEL) WHERE a.b_property > 35 RETURN (n)
iterate over the list, checking for a relationship from the first one to each of the rest, and creating one if it's not there.
Could the values change over time? If so, you may need a corresponding function that removes this kind of relationship from nodes whose score is now <= the threshold value.
It's harder to advise about how to store the collections, without knowing more about the problem you're solving.

QUERY results with count, group, order functions

I have 5 columns of data I want to return from a query, plus a count of the first column.
A couple of other things I want is to only include listings that are active (which is stored by the tag "Include" in Column M) and I want the data to be randomized (I do this by creating a random number generator in column P). Neither of these last 2 should be displayed. The data I wanted to be returned is located in Columns Q, R, S, T, U.
My data looks like this:
M N O P Q R S T U
Active Text Text RN Phone# ID Name Level Location
Include text text 0.51423 10000001 1223 Bob Level 2 Boston
Include text text 0.34342 10000005 2234 Dylan Level 3 San Francisco
Exclude text text 0.56453 10000007 2311 Janet Level 8 Des Moines
Include text text 0.23122 10000008 2312 Gina Level 8 Houston
Include text text 10000001 1225 Ronda Level 3 Boston
Include text text 10000001 1236 Nathan Level 2 Boston
So, ideally, results would look like:
count Phone# Phone# ID Name Level Location
3 10000001 1223 Bob Level 2 Boston
1 10000005 2234 Dylan Level 3 San Francisco
1 10000008 2312 Gina Level 8 Houston
I don't care what ID or Name shows up behind the phone number so long as it's one of the numbers on the list.
Now, I have been able to get the function to work separately (ORDER and COUNT), but can't get both to work in 1 function:
Worked:
=QUERY(Function!M:U, "SELECT count (Q), Q where O = 'Include' group by Q")
=QUERY(Function!M:U, "SELECT Q, R, S, T, U where O = 'Include' ORDER BY P DESC")
Did not work:
=QUERY(Function!M:U, "SELECT count (Q), Q group by Q, R, S, T, U where O = 'Include' group by Q ORDER BY P DESC, R, S, T, U")
=QUERY(Function!M:U, "SELECT count (Q), Q, R, S, T, U group by Q where O = 'Include' group by Q ORDER BY P DESC")
=QUERY(Function!M:U, "SELECT count (Q), Q group by Q where O = 'Include' group by Q ORDER BY P DESC, R, S, T, U")
Maybe someone has an idea of where I'm going wrong with combining the two different types of syntax? Help is much appreciated! :)

=ARRAYFORMULA({"count Phone#", "Phone#", "ID", "Name", "Level", "Location";
QUERY(Function!M3:U,
"select count(Q),Q where P is not null group by Q label count(Q)''", 0),
IFERROR(VLOOKUP(INDEX(QUERY(Function!M3:U,
"select Q,count(Q) where P is not null group by Q label count(Q)''", 0),,1),
QUERY(Function!M3:U,
"select Q,R,S,T,U where P is not null order by P desc", 0), {2, 3, 4, 5}, 0))})
cell P2:
=ARRAYFORMULA({"RN"; IF(M3:M="Include", RANDBETWEEN(ROW(A3:A),99^9), )})

Return multiple sums of relationship weights using cypher

I have a graph with one node type 'nodeName' and one relationship type 'relName'. Each node pair has 0-1 'relName' relationships with each other but each node can be connected to many nodes.
Given an initial list of nodes (I'll refer to this list as the query subset) I want to:
Find all the nodes that connect to the query subset
I'm currently doing this (which may be overly convoluted):
MATCH (a: nodeName)-[r:relName]-()
WHERE (a.name IN ['query list'])
WITH a
MATCH (b: nodeName)-[r2:relName]-()
WHERE NOT (b.name IN ['query list'])
WITH a, b
MATCH (a)--(b)
RETURN DISTINCT b
Then for each connected node (b) I want to return the SUM of the weights that connect to the query subset
For example. If node b1 has 4 edges that connect to nodes in the query subset I would like to RETURN SUM(r2.weight) AS totalWeight for b2. I actually need a list of all the b nodes ordered by totalWeight.
No. 2 is where I'm stuck. I've been reading the docs about FOREACH and reduce() but I'm not sure how to apply them here.
Speed is important as I have 30,000 nodes and 1.5M edges if you have any suggestions regarding this please throw them into the mix.
Many thanks
Matt

Why do you need so many Match statements? You can specify a nodes and b nodes in single Match statement and select only those who have a relationship between them.
After that just return b nodes and sum of the weights. b nodes will automatically be acting as a group by if it is returned along with aggregation function such as sum.
MATCH (a:nodeName)-[r:relName]-(b:nodeName)
WHERE (a.name IN ['query list']) AND NOT((b.name IN ['query list']))
RETURN b.name, sum(r.weight) as weightSum order by weightSum

I think we can simplify that query a bit.
MATCH (a: nodeName)
WHERE (a.name IN ['query list'])
WITH collect(a) as subset
UNWIND subset as a
MATCH (a)-[r:relName]-(b)
WHERE NOT b in subset
RETURN b, sum(r.weight) as totalWeight
ORDER BY totalWeight ASC
Since sum() is an aggregating function, it will make the non-aggregation variables the grouping key (in this case b), so the sum is per b node, then we order them (switch to DESC if needed).

Neo4j: How to fix one of the node in match clause?

For clauses like MATCH (a:Address)-[:BelongTo]->(w1:Wallet), (a)-[r0:BelongTo]->(w2:Wallet) WHERE ID(w1)>ID(w2) WITH w1, w2..., is it possible to make sure that ex. w1 is always a fixed node? If yes, is it possible to decide on the node by choosing the node having ex. the minimum value for a certain property over all the nodes which could also be w1?
More concretely, for example, an address a belong to wallet a, b, c with a>b>c in terms of ID. Then normally these rows of result will be returned:
w1 w2
--------
a b
b c
a c
I only want these two rows of result to be returned:
w1 w2
--------
a b
a c
Note: I want the query try to get every pair of wallets to both an address belongs to. All addresses which belongs to two or more wallet should be included in return if a is returned.
So for example, If there are two addresses which belong to three different wallets, what would the query you posted do?
More concretely, if addresses a1 and a2 belong to b1, c1, d1 and b2, c2, d2 respectively, (with b1 > c1 > d1> b2> c2>d2 in terms of id)
I want it to return:
a w1 w2
-----------
a1 b1 c1
a1 b1 d1
a2 b2 c2
a2 b2 d2
Is it possible?

Yes, you can do this by finding (for each a:Address), the :Wallet with the minimum id. After you match to this :Wallet, you can match to all the other :Wallets.
MATCH (a:Address)-[:BelongTo]->(w1:Wallet)
WITH a, min(id(w1)) as minId
// since we have the minId, we can do a fast lookup of the node
MATCH (minW:Wallet)
WHERE id(minW) = minId
// now get all the others
MATCH (a)-[:BelongTo]->(w2:Wallet)
WHERE minW <> w2
...
If you don't care how the fixed node is taken, and if it only matters for the duration of the query, it may be easier to collect all the :Wallet nodes, take the first node in the collection, and then UNWIND the rest into rows and continue the query:
MATCH (a:Address)-[:BelongTo]->(w:Wallet)
WITH a, collect(w) as wallets
WITH a, head(wallets) as w1, wallets
UNWIND tail(wallets) as w2
...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to cluster rows in dataset with two geospatial attributes - machine-learning

Related

Cypher count instances greater than

Store a collection as a node in Neo4j

QUERY results with count, group, order functions

Return multiple sums of relationship weights using cypher

Neo4j: How to fix one of the node in match clause?

Categories

Resources