Find the nodes with the most mutual connecting nodes? - neo4j

I'm working with a data set that contains customers, their purchases and the businesses they purchased from, and I'm trying to determine which businesses share the highest number of mutual customers. Ideally the output would be a table that lists the connected businesses and the number of mutual customers. I.e.:
| BUSINESS_1 - BUSINESS_2 | 4 |
| BUSINESS_1 - BUSINESS_5 | 3 |
| BUSINESS_3 - BUSINESS_7 | 2 |
| BUSINESS_4 - BUSINESS_9 | 2 |
I don't have much at this point, but the query I'm working with looks something like this:
MATCH (c:Customer)<-[:Trans_Cust]-(t:Transaction)-[:Trans_Business]->(b:Business)
RETURN c, t, b
Thanks in advance

I guess this should do the trick, maybe provide a sample dataset on http://console.neo4j.org for us to help.
MATCH (b:Business)
MATCH (b)<-[:Trans_Business]-(t:Transaction)-[:Trans_Cust]->(c:Customer)
MATCH (c)<-[:Trans_Cust]-(:Transaction)-[:Trans_Business]->(other:Business)
WHERE b <> other
WITH b, other, collect(distinct(customer)) as customers
RETURN b, other, size(customers) as sharedCustomers
ORDER BY sharedCustomers DESC

Related

Query distinct property and return complete nodes

I got a lot of nodes, some with similar values in field X, I want to select by distinct X values and take all the popular nodes (order by some other field Y) with all their properties.
Example:
ID | X | Y | Name
1 | A | 100 | David
2 | A | 10 | Chris
3 | B | 5 | Brad
4 | B | 25 | Amber
Should return:
1 | A | 100 | David
4 | B | 25 | Amber
I managed to get the list by distinct X:
MATCH (u:NodeType)
RETURN DISTINCT u.X
I need to find the most popular (highest value of Y) nodes to join with my distinct nodes (which are now only a single property) and return whole nodes (with all the properties).
You are looking for an arg max-style query. I recently answered a similar problem using collect:
MATCH (u:NodeType)
WITH u
ORDER BY u.Y DESC
WITH u.X AS X, collect(u)[0] AS u
RETURN u
The idea is the following:
Order by the value of Y (descending).
Implicitly group by the values of X and for the aggregating function, use collect to gather other values to a list. The elements of the list are the nodes (which are still stored according to a descending order of Y).
For each collected list, select the first element with [0].
Maybe the query is a bit easier to read if you perform the last step in a separate clause (and not in the WITH clause that performs the collect):
MATCH (u:NodeType)
WITH u
ORDER BY u.Y DESC
WITH u.X AS X, collect(u) AS us
RETURN us[0] AS u

Getting Mutliple results from different relationships with Cypher

I am sure this question has been asked but I can't find it.
I have a social graph and I want to be able to show people suggestions based on 3 different relationships in one result.
I have 3 different nodes (Skill, Interest, Title)
Each person has a relationship of SKILL_OF, INTEREST_OF, and IS_TITLED respectively.
I would like to have a single (unique if possible) results set of Matching the person, then finding people that have the same skills, interests, and job title.
I tried to start with 2 results (and then wanted to add title on after) but here is what I have.
MATCH (p:Person { username:'wkolcz' })-[INTEREST_OF]->(Interest)<-[i:INTEREST_OF]-(f:Person)
MATCH(p)-[SKILL_OF]->(s:Skill)<-[sk:SKILL_OF]-(sf:Person)
RETURN f.first_name,f.last_name, sf.first_name, sf.last_name, i, s
I tried to make the matching person the same variable but, as you experts know, that failed. I got a result set but it doesn't make sense to me how I could then display it.
I would like a single list of first_name, last_name, username from the 2 and bonus points of I could get the matches also returned (i and s) so I could display the matching results (This person also has skill(s) in X or This person also has interest in X)
Thanks and let me know!
[EDITED]
This turned out to be a very interesting problem.
I provide a solution that:
Only returns a single result row for every person.
Displays all the interests and skills shared by that person and wkolcz as separate collections. (I presume that people in the DB can have multiple interests and skills.)
The solution finds all the people with shared interests and/or skills in a single MATCH clause.
MATCH (p:Person { username:'wkolcz' })-[r1:INTEREST_OF|SKILL_OF]->(n)<-[r2:INTEREST_OF|SKILL_OF]-(f)
WHERE TYPE(r1) = TYPE(r2)
WITH f, COLLECT(TYPE(r1)) AS ts, COLLECT(n.name) AS names
RETURN f.first_name, f.last_name, f.username,
REDUCE(s = { interests: [], skills: []}, i IN RANGE(0, LENGTH(ts)-1) | CASE
WHEN ts[i] = "INTEREST_OF"
THEN { interests: s.interests + names[i], skills: s.skills }
ELSE { interests: s.interests, skills: s.skills + names[i]} END ) AS shared;
Here is a console that shows these sample results:
+---------------------------------------------------------------------------------------------+
| f.first_name | f.last_name | f.username | shared |
+---------------------------------------------------------------------------------------------+
| "Fred" | "Smith" | "fsmith" | {interests=[Bird Watching], skills=[]} |
| "Oscar" | "Grouch" | "ogrouch" | {interests=[Bird Watching, Politics], skills=[]} |
| "Wilma" | "Jones" | "wjones" | {interests=[Bird Watching], skills=[Woodworking]} |
+---------------------------------------------------------------------------------------------+

Do labels order effects search time?

I'm using neo4j 2.1.7 Recently i was experimenting with Match queries, searching for nodes with several labels. And i found out, that generally query
Match (p:A:B) return count(p) as number
and
Match (p:B:A) return count(p) as number
works different time, extremely in cases when you have for example 2 millions of Nodes A and 0 of Nodes B.
So do labels order effects search time? Is this future is documented anywhere?
Neo4j internally maintains a labelscan store - that's basically a lookup to quickly get all nodes carrying a definied label A.
When doing a query like
MATCH (n:A:B) return count(n)
labelscanstore is used to find all A nodes and then they're filtered if those nodes carry label B as well. If n(A) >> n(B) it's way more efficient to do MATCH (n:B:A) instead since you look up only a few B nodes and filter those for A.
You can use PROFILE MATCH (n:A:B) return count(n) to see the query plan. For Neo4j <= 2.1.x you'll see a different query plan depending on the order of the labels you've specified.
Starting with Neo4j 2.2 (milestone M03 available as of writing this reply) there's a cost based Cypher optimizer. Now Cypher is aware of node statistics and they are used to optimize the query.
As an example I've used the following statements to create some test data:
create (:A:B);
with 1 as a foreach (x in range(0,1000000) | create (:A));
with 1 as a foreach (x in range(0,100) | create (:B));
We have now 100 B nodes, 1M A nodes and 1 AB node. In 2.2 the two statements:
MATCH (n:B:A) return count(n)
MATCH (n:A:B) return count(n)
result in the exact same query plan (and therefore in the same execution speed):
+------------------+---------------+------+--------+-------------+---------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+------------------+---------------+------+--------+-------------+---------------+
| EagerAggregation | 3 | 1 | 0 | count(n) | |
| Filter | 12 | 1 | 12 | n | hasLabel(n:A) |
| NodeByLabelScan | 12 | 12 | 13 | n | :B |
+------------------+---------------+------+--------+-------------+---------------+
Since there are only few B nodes, it's cheaper to scan for B's and filter for A. Smart Cypher, isn't it ;-)

Express abscence of edges in Cypher

how do I express the following in Cypher
"Return all nodes with at least one incoming edge of type A and no outgoing edges".
Best Regards
You can use a pattern to exclude nodes from the result subset like this:
MATCH ()-[:A]->(n) WHERE NOT (n)-->() RETURN n
Try
MATCH (n)
WHERE ()-[:A]->n AND NOT n-->()
RETURN n
or
MATCH ()-[:A]->(n)
WHERE NOT n-->()
RETURN DISTINCT n
Edit
Pattern expressions can be used both for pattern matching and as predicates for filtering. If used in the MATCH clause, the paths that answer the pattern are included in the result. If used for filtering, in the WHERE clause, the pattern serves as a limiting condition on the paths that have previously been matched. The result is limited, not extended to include the filter condition. When a pattern is used as a predicate for filtering, the negation of that predicate is also a predicate that can be used as a filter condition. No path answers to the negation of a pattern (if there is such a thing) so negations of patterns cannot be used in the MATCH clause. The phrase
Return all nodes with at least one incoming edge of type A and no outgoing edges
involves two patterns on nodes n, namely any incoming relationship [:A] on n and any outgoing relationship on n. The second must be interpreted as a pattern for a predicate filter condition since it involves a negation, not any outgoing relationship on n. The first, however, can be interpreted either as a pattern to match along with n, or as another pattern predicate filter condition.
These two interpretations give rise to the two cypher queries above. The first query matches all nodes and uses both patterns to filter the result. The second matches the incoming relationship on n along with n and uses the second pattern to filter the results.
The first query will match every node only once before the filtering happens. It will therefore return one result item per node that meets the criteria. The second query will match the pattern any incoming relationship [:A] on n once for each path, i.e. once for each incoming relationship on n. It may therefore contain a node multiple times in the result, hence the DISTINCT keyword to remove doubles.
If the items of interest are precisely the nodes, then using both patterns for predicates in the WHERE clause seems to me the correct interpretation. It is also more efficient since it needs to find only zero or one incoming [:A] on n to resolve the predicate. If the incoming relationships are also of interest, then some version of the second query is the right choice. One would need to bind the relationship and do something useful with it, such as return it.
Below are the execution plans for the two queries executed on a 'fresh' neo4j console.
First query:
----
Filter
|
+AllNodes
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
| Filter | 0 | 0 | | (nonEmpty(PathExpression((17)-[ UNNAMED18:A]->(n), true)) AND NOT(nonEmpty(PathExpression((n)-[ UNNAMED36]->(40), true)))) |
| AllNodes | 6 | 7 | n, n | |
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
Second query:
----
Distinct
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+--------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+--------------------------------------------------------------+
| Distinct | 0 | 0 | | |
| Filter | 0 | 0 | | NOT(nonEmpty(PathExpression((n)-[ UNNAMED30]->(34), true))) |
| TraversalMatcher | 0 | 13 | | n, UNNAMED8, n |
+------------------+------+--------+-------------+--------------------------------------------------------------+

understanding cypher output

I have a graph like this:
(2)<-[0:CHILD]-(1)-[1:CHILD]->(3)
In words: Node 1,2 and 3 (all with names); Edges 0 and 1
I write the following cypher-query:
START nodes = node(1,2,3), relationship = relationship(0,1)
RETURN nodes, relationship
and got as a result:
==> +-----------------------------------------------+
==> | nodes | relationship |
==> +-----------------------------------------------+
==> | Node[1]{name->"Risikogruppe2"} | :CHILD[0] {} |
==> | Node[1]{name->"Risikogruppe2"} | :CHILD[1] {} |
==> | Node[2]{name->"Beruf 1"} | :CHILD[0] {} |
==> | Node[2]{name->"Beruf 1"} | :CHILD[1] {} |
==> | Node[3]{name->"Beruf 2"} | :CHILD[0] {} |
==> | Node[3]{name->"Beruf 2"} | :CHILD[1] {} |
==> +-----------------------------------------------+
==> 6 rows, 0 ms
now my question:
why I became all nodes twice and relationships three time? I just want to get all of it one time.
thanks for your time ^^
The way Cypher works is very similar to SQL. When you create your variables in your START clause, you're sort of doing a from nodes, relationships in SQL (tables). The reason you're getting a cartesian product of all of the possible values for the two, is because you're not doing any sort of match or where to filter them, so it's basically like:
select *
from nodes, relationships
Where you forgot to put the foreign key relationship between the tables.
In Cypher, you do this by doing a match, usually:
start n=node(1,2,3), r=relationship(0,1)
match n-[r]-m // find where the n nodes and the r relationships point (to m)
return *
But since you have no match, you get a cartesian product.
You should only see the nodes and relationships once, unless you do some matching.
Tried to reproduce your problem, but I haven't been able to.
http://tinyurl.com/cobd8oq
Is it possible for you to create an console.neo4j.org example of your problem?
Thanks,
Andrés

Resources