I have the following cypher query:
MATCH (country:Country { name: 'norway' }) <- [:LIVES_IN] - (person:Person)
WITH person
MATCH (skill:Skill { name: 'java' }) <- [:HAS_SKILL] - (person)
WITH person
OPTIONAL MATCH (skill:Skill { name: 'javascript' }) <- [rel:HAS_SKILL] - (person)
WITH person, CASE WHEN skill IS NOT NULL THEN 1 ELSE 0 END as matches
ORDER BY matches DESC
LIMIT 50
RETURN COLLECT(ID(person)) as personIDs
It seems to perform worse when adding more nodes. Right now with only 5000 Person nodes (a Person node can have multiple HAS_SKILL relationships to Skill nodes). Right now it takes around 180 ms to perform the query, but adding another 1000 Person nodes with relationships adds 30-40 ms to the query. We are planning on having millions of Person nodes, so adding 40 ms every 1000 Person is a no go.
I use parameters in my query instead of 'norway', 'java', 'javascript' in the above query. I have created indexes on :Country(name) and :Skill(name).
My goal with the query is to match every person that lives in a specified country (norway) which also have the skill 'java'. If the person also have the skill 'javascript' it should be ordered higher in the result.
How can I restructure the query to improve performance?
Edit:
There also seems to be an issue with the :Country nodes, if I switch out
MATCH (country:Country { name: 'norway' }) <- [:LIVES_IN] - (person:Person)
with
MATCH (city:City { name: 'vancouver' }) <- [:LIVES_IN] - (person:Person)
the query time jumps down to around 15-50 ms, depending on what city i query for. It is still a noticeable increase in query time when adding more nodes.
Edit 2:
I seems like the query time is increased by a huge amount when there is a lot of rows in the first match clause. So if I switch the query to match on Skill nodes first, the query times decreases substantially. The query is part of an API and it is created dynamically and I do not know which of the match clauses that will return the smallest amount of rows. It will probably also be a lot more rows in every match clause when the database grows.
Edit 3
I have done some testing from the answers and I now have the following query:
MATCH (country:Country { name: 'norway'})
WITH country
MATCH (country) <- [:LIVES_IN] - (person:Person)
WITH person
MATCH (person) - [:HAS_SKILL] -> (skill:Skill) WHERE skill.name = 'java'
MATCH (person) - [:MEMBER_OF_GROUP] -> (group:Group) WHERE group.name = 'some_group_name'
RETURN DISTINCT ID(person) as id
LIMIT 50
this still have performance issues, is it maybe better to first match all the skills etc, like with the Country node? The query can also grow bigger, I may have to add matching against multiple skills, groups, projects etc.
Edit 4
I modified the query slightly and it seems like this did the trick. I now match all the needed skills, company, groups, country etc first. Then use those later in the query. In the profiler this reduced the database hits from 700k to 188 or something. It is a slightly different query from my original query (different labeled nodes etc), but it solves the same problem. I guess this can be further improved by maybe matching on the node with the least relationships first etc, to start with a reduced number of nodes. I'll do some more testing later!
MATCH (company:Company { name: 'relinkgroup' })
WITH company
MATCH (skill:Skill { name: 'java' })
WITH company, skill
MATCH (skill2:Skill { name: 'ajax' })
WITH company, skill, skill2
MATCH (country:Country { name: 'canada' })
WITH company, skill, skill2, country
MATCH (company) <- [:WORKED_AT] - (person:Person)
, (person) - [:HAS_SKILL] -> (skill)
, (person) - [:HAS_SKILL] -> (skill2)
, (person) - [:LIVES_IN] -> (country)
RETURN DISTINCT ID(person) as id
LIMIT 50
For the first line of your query, the execution has to look for all possible paths between the country and person. Limiting your initial match (thus defining a more accurate starting point for the traversal) you'll win some performance.
So instead of
MATCH (country:Country { name: 'norway' }) <- [:LIVES_IN] - (person:Person)
Try doing it in two steps :
MATCH (country:Country { name: 'norway' })
WITH country
MATCH (country)<-[:LIVES_IN]-(person:Person)
WITH person
As an example, I'll use the simple movie app in the neo4j console : http://console.neo4j.org/
Doing a query equivalent to yours for finding people that knows cypher :
MATCH (n:Crew)-[r:KNOWS]-m WHERE n.name='Cypher' RETURN n, m
The execution plan will be :
Execution Plan
ColumnFilter
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+----------------------------------------+
| ColumnFilter | 2 | 0 | | keep columns n, m |
| Filter | 2 | 14 | | Property(n,name(0)) == { AUTOSTRING0} |
| TraversalMatcher | 7 | 16 | | m, r, m |
+------------------+------+--------+-------------+----------------------------------------+
Total database accesses: 30
And by defining an accurate starting point :
MATCH (n:Crew) WHERE n.name='Cypher' WITH n MATCH (n)-[:KNOWS]-(m) RETURN n,m
Result in the following execution plan :
Execution Plan
ColumnFilter
|
+SimplePatternMatcher
|
+Filter
|
+NodeByLabel
+----------------------+------+--------+-------------------+----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+-------------------+----------------------------------------+
| ColumnFilter | 2 | 0 | | keep columns n, m |
| SimplePatternMatcher | 2 | 0 | m, n, UNNAMED53 | |
| Filter | 1 | 8 | | Property(n,name(0)) == { AUTOSTRING0} |
| NodeByLabel | 4 | 5 | n, n | :Crew |
+----------------------+------+--------+-------------------+----------------------------------------+
Total database accesses: 13
As you can see, the first method use the traversal pattern, which is quite a bit exponantionnaly expensive with the amount of nodes, and you're doing a global match on the graph.
The second uses an explicit starting point, using the labels index.
EDIT
For the skills part, I would do something like this, if you have some test data to provide it could be more helpful for testing :
MATCH (country:Country { name: 'norway' })
WITH country
MATCH (country)<-[:LIVES_IN]-(person:Person)-[:HAS_SKILL]->(skill:Skill)
WHERE skill.name = 'java'
WITH person
OPTIONAL MATCH (person)-[:HAS_SKILL]->(skillb:Skill) WHERE skillb.name = 'javascript'
WITH person, skillb
There is no need for global lookups, as he already found persons, he just follows the "HAS_SKILL" relationships and filter on skill.name value
Edit 2:
Concerning your last edit, maybe this last part of the query :
MATCH (company) <- [:WORKED_AT] - (person:Person)
, (person) - [:HAS_SKILL] -> (skill)
, (person) - [:HAS_SKILL] -> (skill2)
, (person) - [:LIVES_IN] -> (country)
Could be better written as :
MATCH (person:Person)-[:WORKED_AT]->(company)
WHERE (person)-[:HAS_SKILL]->(skill)
AND (person)-[:HAS_SKILL]->(skill2)
AND (person)-[:LIVES_IN]->(country)
Related
In neo4j my database consists of chains of nodes. For each distinct stucture/layout (does graph theory has a better word?), I want to count the number of chains. For example, the database consists of 9 nodes and 5 relationships as this:
(:a)->(:b)
(:b)->(:a)
(:a)->(:b)
(:a)->(:b)->(:b)
where (:a) is a node with label a. Properties on nodes and relationships are irrelevant.
The result of the counting should be:
------------------------
| Structure | n |
------------------------
| (:a)->(:b) | 2 |
| (:b)->(:a) | 1 |
| (:a)->(:b)->(:b) | 1 |
------------------------
Is there a query that can achieve this?
Appendix
Query to create test data:
create (:a)-[:r]->(:b), (:b)-[:r]->(:a), (:a)-[:r]->(:b), (:a)-[:r]->(:b)-[:r]->(:b)
EDIT:
Thanks for the clarification.
We can get the equivalent of what you want, a capture of the path pattern using the labels present:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
RETURN [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
This will give you a list of the labels of the nodes (the first label present for each...remember that nodes can be multi-labeled, which may throw off your results).
As for getting it into that exact format in your example, that's a different thing. We could do this with some text functions in APOC Procedures, specifically apoc.text.join().
We would need to first add formatting around the extraction of the first label to add the prefixed : as well as the parenthesis. Then we could use apoc.text.join() to get a string where the nodes are joined by your desired '->' symbol:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
WITH [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
RETURN apoc.text.join([label in structure | '(:' + label + ')'], '->') as structure, n
I am sure this question has been asked but I can't find it.
I have a social graph and I want to be able to show people suggestions based on 3 different relationships in one result.
I have 3 different nodes (Skill, Interest, Title)
Each person has a relationship of SKILL_OF, INTEREST_OF, and IS_TITLED respectively.
I would like to have a single (unique if possible) results set of Matching the person, then finding people that have the same skills, interests, and job title.
I tried to start with 2 results (and then wanted to add title on after) but here is what I have.
MATCH (p:Person { username:'wkolcz' })-[INTEREST_OF]->(Interest)<-[i:INTEREST_OF]-(f:Person)
MATCH(p)-[SKILL_OF]->(s:Skill)<-[sk:SKILL_OF]-(sf:Person)
RETURN f.first_name,f.last_name, sf.first_name, sf.last_name, i, s
I tried to make the matching person the same variable but, as you experts know, that failed. I got a result set but it doesn't make sense to me how I could then display it.
I would like a single list of first_name, last_name, username from the 2 and bonus points of I could get the matches also returned (i and s) so I could display the matching results (This person also has skill(s) in X or This person also has interest in X)
Thanks and let me know!
[EDITED]
This turned out to be a very interesting problem.
I provide a solution that:
Only returns a single result row for every person.
Displays all the interests and skills shared by that person and wkolcz as separate collections. (I presume that people in the DB can have multiple interests and skills.)
The solution finds all the people with shared interests and/or skills in a single MATCH clause.
MATCH (p:Person { username:'wkolcz' })-[r1:INTEREST_OF|SKILL_OF]->(n)<-[r2:INTEREST_OF|SKILL_OF]-(f)
WHERE TYPE(r1) = TYPE(r2)
WITH f, COLLECT(TYPE(r1)) AS ts, COLLECT(n.name) AS names
RETURN f.first_name, f.last_name, f.username,
REDUCE(s = { interests: [], skills: []}, i IN RANGE(0, LENGTH(ts)-1) | CASE
WHEN ts[i] = "INTEREST_OF"
THEN { interests: s.interests + names[i], skills: s.skills }
ELSE { interests: s.interests, skills: s.skills + names[i]} END ) AS shared;
Here is a console that shows these sample results:
+---------------------------------------------------------------------------------------------+
| f.first_name | f.last_name | f.username | shared |
+---------------------------------------------------------------------------------------------+
| "Fred" | "Smith" | "fsmith" | {interests=[Bird Watching], skills=[]} |
| "Oscar" | "Grouch" | "ogrouch" | {interests=[Bird Watching, Politics], skills=[]} |
| "Wilma" | "Jones" | "wjones" | {interests=[Bird Watching], skills=[Woodworking]} |
+---------------------------------------------------------------------------------------------+
I want to calculate a "corporation index" for a particular user based on how many times a user viewed or updated a file. In order to get this I'll assign values to certain paths. This is an example:
(u1:User {name: 'Alice'})-[:UPDATED]->(f:File)<-[:VIEWED]-(u2:User) // is worth 0.2 points
(u1:User {name: 'Alice'})-[:VIEWED]->(f:File)<-[:VIEWED]-(u2:User) // is worth 0.1 points
(u1:User {name: 'Alice'})-[:VIEWED]->(f:File)<-[:UDATED]-(u2:User) // is worth 0.2 points
(u1:User {name: 'Alice'})-[:UPDATED]->(f:File)<-[:UPDATED]-(u2:User) // is worth 0.5 points
The image shows a possible graph.
I want to know how a query has to look like that returns the following results.
User: Alice, User: Charly, index: (3 * 0.2) // 3 because there are 3 matching paths (Relationship with the lowest weight in the path)
User: Alice, User: Bob, index: (3 * 0.1)
This is what I have so far:
MATCH (u1:User {name:'Alice'})-[r1:VIEWED]->(f:File)<-[r2:UPDATED]-(u2:User)
OPTIONAL MATCH (u1:User {name:'Alice'})-[r3:VIEWED]->(f:File)<-[r4:VIEWED]-(u3:User)
RETURN u2.name, min(r1.weight) AS ViewUpd, u3.name, min(r3.weight) AS ViewView
This query doesn't work at all, but I hope it clarifies what I want.
[EDITED]
Does this query do what you want?
MATCH (u1:User { name:'Alice' })-[r1]->(f:File)<-[r2]-(u2:User)
RETURN
u2.name AS name,
SUM(
r1.weight * (CASE
WHEN (TYPE(r1)= "VIEWED" AND TYPE(r2)= "VIEWED") THEN 0.1
WHEN (TYPE(r1)= "UPDATED" AND TYPE(r2)= "UPDATED") THEN 0.5
ELSE 0.2
END)) AS index;
Here is a console showing this query, and here is a sample result:
+--------------------------------+
| name | index |
+--------------------------------+
| "Bob" | 0.6000000000000001 |
| "Elvis" | 1.5 |
| "David" | 0.6000000000000001 |
| "Charley" | 2.1 |
+--------------------------------+
In my sample data, "Charley" has an UPDATED relationship and a VIEWED relationship with the File that is UPDATED by "Alice". The resulting index for Charley is the sum of the index values for both of those relationships.
I'm using neo4j 2.1.7 Recently i was experimenting with Match queries, searching for nodes with several labels. And i found out, that generally query
Match (p:A:B) return count(p) as number
and
Match (p:B:A) return count(p) as number
works different time, extremely in cases when you have for example 2 millions of Nodes A and 0 of Nodes B.
So do labels order effects search time? Is this future is documented anywhere?
Neo4j internally maintains a labelscan store - that's basically a lookup to quickly get all nodes carrying a definied label A.
When doing a query like
MATCH (n:A:B) return count(n)
labelscanstore is used to find all A nodes and then they're filtered if those nodes carry label B as well. If n(A) >> n(B) it's way more efficient to do MATCH (n:B:A) instead since you look up only a few B nodes and filter those for A.
You can use PROFILE MATCH (n:A:B) return count(n) to see the query plan. For Neo4j <= 2.1.x you'll see a different query plan depending on the order of the labels you've specified.
Starting with Neo4j 2.2 (milestone M03 available as of writing this reply) there's a cost based Cypher optimizer. Now Cypher is aware of node statistics and they are used to optimize the query.
As an example I've used the following statements to create some test data:
create (:A:B);
with 1 as a foreach (x in range(0,1000000) | create (:A));
with 1 as a foreach (x in range(0,100) | create (:B));
We have now 100 B nodes, 1M A nodes and 1 AB node. In 2.2 the two statements:
MATCH (n:B:A) return count(n)
MATCH (n:A:B) return count(n)
result in the exact same query plan (and therefore in the same execution speed):
+------------------+---------------+------+--------+-------------+---------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+------------------+---------------+------+--------+-------------+---------------+
| EagerAggregation | 3 | 1 | 0 | count(n) | |
| Filter | 12 | 1 | 12 | n | hasLabel(n:A) |
| NodeByLabelScan | 12 | 12 | 13 | n | :B |
+------------------+---------------+------+--------+-------------+---------------+
Since there are only few B nodes, it's cheaper to scan for B's and filter for A. Smart Cypher, isn't it ;-)
I have this json:
{
"name":"david", //node:Person
"TAKING":[ //link
{"name":"math"}, //node:Subject
{"name":"physics"} //node:Subject
],
"FRIEND":[ //link
{"name":"andres"}, //node:Person
{"name":"luis"} //node:Person
]
}
And I have this query to extract it from neo4j
start person=node(*) match person-[:FRIEND]->friend, person-[:TAKING]->subject where person.name="Andres" return person, collect(distinct friend), collect(distinct subject);
The result is this:
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | person | collect(distinct friend) | collect(distinct subject) |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | Node[1]{name:"Andres",title:"Developer"} | [Node[2]{name:"David",title:"Developer"},Node[3]{name:"Luis",title:"Developer"}] | [Node[5]{name:"math"},Node[6]{name:"physics"}] |
==> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I think this part of the query can be better:
person-[:FRIEND]->friend, person-[:TAKING]->subject
The goal is to avoid the distinc clause in the return part:
collect(distinct friend), collect(distinct subject)
I rewrite it to:
subject<-[:TAKING]-person-[:FRIEND]->friend
but same result.
Is there a better way to make this query?, and Is there a way to build the original json with cypher?
Try the following query as demonstrated in http://console.neo4j.org/?id=mlwmlt to avoid the DISTINCT keyword:
START person=node(*)
WHERE HAS (person.name) AND person.name='A'
WITH person
MATCH (subject)<-[:TAKING]-(person)
WITH person, COLLECT(subject) AS subjects
MATCH (person)-[:FRIEND]->(friend)
RETURN person, subjects, COLLECT(friend)
But in general you should not use node(*). A good idea would be using an index of names instead.