Is there anything like a "do while" match pattern that satisfy an aggregated value ? (propeties etc) - neo4j

I dont know if this make sense using Cypher or graph traversal, but i was trying to do sort of a "shortest path" query but not based on weighted relationship but rather aggregated properties.
Assume i have nodes labeled People and they all vists different homepages with a VISIT relationship to the homepage node. Each homepage node has hits stats depending on its popularity. Now i would like to match people that has a visit relationship to a homepage until i reach max X number of exposure (hits).
Why ? Becuase then i know a "expected" exposure strategy for a certain group of people.
Something like
Do
MATCH (n:People)-[:VISITS]-(sites)
while (reduce (x)<100000)
Of course this "Do while" is nothing i have seen in the Cypher syntax but wouldn't it be useful? or should this be on app level by just returning a DESC list and do the math on in the applicaton. Mabey it should also be matched with some case if the loop cant be satisfied.
MATCH (n:People)-[:VISITS]-sites
WITH reduce(hits=0, x IN collect(sites.dailyhits)| hits + x) AS totalhits
RETURN totalhits;
Can return the correct aggregated hits value (all), but i would like this function to run each matched pattern until it satisfy a value and the return the match (of course i miss other possible and mixes between pages becuase the match never traversal the entire graph..but at least i have got an answer of pages in a list that match the requirement if it makes sense) ?
Thanks!

Not sure how you'd aggregate, but there are several aggregation functions (avg, sum, etc). And... you can pass these to a 2nd part of the cypher query, with a WITH clause.
That said: Cypher also supports the ability to sort a result (ORDER BY), and the ability to limit the number of results given (LIMIT). I don't know what you'd sort by, but... just for fun, let's sort it arbitrarily on something:
MATCH (n:People)-[v:VISITS]->(site:Site)
WHERE site.url= "http://somename.com"
RETURN n
ORDER BY v.VisitCount DESC
LIMIT 1000
This would cap your return set at 1,000 people, for people who visit a given site.

Related

Cypher: retrieve all attached nodes of more than one type?

Beginner Cypher question. I know how to get all the nodes of a particular type attached to a particular person in my database. Here I am retrieving all the friends of a particular person, within 10 hops:
MATCH (rebecca:Person {name:"Rebecca"})-[r*1..10]->(friends:Friend)
RETURN rebecca, friends
But how would I extend this to get nodes of two types: either the friends, or the neighbours, of Rebecca?
You can filter on the label of the friends identifier :
MATCH (rebecca:Person {name:"Rebecca"})-[r*1..10]->(other)
WHERE ALL( x IN ["Friend","Neighbour"] WHERE x IN labels(other) )
RETURN rebecca, other
NB: The answer from InverseFalcon is perfectly valid, here it is just another way to do this filter.
Note that this is not really ideal, FRIEND and NEIGHBOUR are semantically best described as relationships and you can see here that when
going away from the natural way of thinking as a graph (relationships matters!) you suffer from it in your queries.
There isn't an OR we can use on the label in the MATCH itself, so you may have to filter with a WHERE clause:
MATCH (rebecca:Person {name:"Rebecca"})-[r*1..10]->(friendOrNeighbor)
WHERE friendOrNeighbor:Friend or friendOrNeighbor:Neighbor
RETURN DISTINCT rebecca, friendOrNeighbor
Keep in mind variable-length relationship matches like this are meant to find all possible paths up to the given max limit, so this is actually doing extra work that you may not need, that may be slow if there are many relationships within that local graph.
You may want to consider apoc.path.expandConfig() from APOC Procedures. If you use 'NODE_GLOBAL' for uniqueness, and specify the upper bound with maxLevel: 10, it's a much more efficient means of getting the nodes you want faster.

Cypher directionless query not returning all expected paths

I have a cypher query that starts from a machine node, and tries to find nodes related to it using any of the relationship types I've specified:
match p1=(n:machine)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*]-(n2)
where n.machine="112943691278177215"
optional match p2=(n2)-[*]->()
return p1,p2
limit 300
The optional match clause is my attempt to traverse outwards in my model from each of the nodes found in p1. The below screenshot shows the part of the results I'm having issues with:
You can see from the starting machine node, it finds a personal_phone node via two app nodes related to the machine. For clarification, this part of the model is designed like so:
So it appeared to be working until I realized that certain paths were somehow being left out of the results. If I run a second query showing me all apps related to that particular personal_phone node, I get the following:
match p1=(n:personal_phone)<-[*]-(n2)
where n.personal_phone="(xxx) xxx-xxxx"
return p1
limit 100
The two apps I have segmented out, are the two apps shown in the earlier image.
So why doesn't my original query show the other 7 apps related to the personal_phone?
EDIT : Despite the overly broad optional match combined with the limit 300 statement, the returned results show only 52 nodes and 154 rels. This is because the paths following relationships with an outward direction are going to stop very quickly. I could have put a max 2 on it but was being lazy.
EDIT 2: The query I finally came up with to give me what I want is this:
match p1=(m:machine)<-[:MACHINE]-(a:app)
where m.machine="112943691278177215"
optional match p2=(a:app)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*0..3]-(n)
where a<>n and a<>m and m<>n
optional match p3=(n)-[r*]->(n2)
where n2<>n
return distinct n, r, n2
This returns 74 nodes and 220 rels which seems to be the correct result (387 rows). So it seems like my incredibly inefficient query was the reason the graph was being truncated. Not only were the nodes being traversed many times, but the paths being returned contained duplicate information which consumed the limited rows available for return. I guess my new questions are:
When following multiple hops, should I always explicitly make sure the same nodes aren't traversed via where clauses?
If I was to return p3 instead, it returns 1941 rows to display 74 nodes and 220 rels. There seems to be a lot of duplication present. Is it typically better to use return distinct (like I have above) or is there a way to easily dedupe the nodes and relationships within a path?
So part of your issue here (updated questions) is that you're returning paths, and not individual nodes/relationships.
For example, if you do MATCH p=(n)-[*]-() and your data is A->B->C->D then the results you'll get will be A->B, A->B->C, A->B->C->D and so on. If on the other hand you did MATCH (n)-[r:*]-(m) and then worked with r and m, you could get the same data, but deal with the distinct things on the path rather than have to sort that out later.
It seems you want the nodes and relationships, but you're asking for the paths - so you're getting them. ALL of them. :)
When following multiple hops, should I always explicitly make sure the
same nodes aren't traversed via where clauses?
Well, the way you did it, yes -- but honestly I haven't ever run into that problem before. Part of the issue again is the overly-broad query you're running. Lacking any constraint, it ends up roping in the items you've already matched, which buys you this problem. Perhaps better would be to match some set of possible labels, to narrow your query down. By narrowing it down, you wouldn't have the same issue, for example something like:
MATCH (n)-[r:*]-(m)
WHERE 'foo' in labels(m) or 'bar' in labels(m)
RETURN n, r, m;
Note we're not doing path matching, and we're specifying some range of labels that could be m, without leaving it completely wild-west. I tend to formulate queries this way, so your question #2 never really arises. Presumably you have a reasonable data model that would act as your grounding for that.

What with clause do? Neo4j

I don't understand what WITH clause do in Neo4j. I read the The Neo4j Manual v2.2.2 but it is not quite clear about WITH clause. There are not many examples. For example I have the following graph where the blue nodes are football teams and the yellow ones are their stadiums.
I want to find stadiums where two or more teams play. I found that query and it works.
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with a, count(*) as foaf
where foaf > 1
return a
count(*) says us the numbers of matching rows. But I don't understand what WITH clause do.
WITH allows you to pass on data from one part of the query to the next. Whatever you list in WITH will be available in the next query part.
You can use aggregation, SKIP, LIMIT, ORDER BY with WITH much like in RETURN.
The only difference is that your expressions have to get an alias with AS alias to be able to access them in later query parts.
That means you can chain query parts where one computes some data and the next query part can use that computed data. In your case it is what GROUP BY and HAVING would be in SQL but WITH is much more powerful than that.
here is another example
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with distinct a
order by a.name limit 10
match (a)-[:IN_CITY]->(c:City)
return c.name

Neo4j Cypher - Vary traversal depth conditional on number of nodes

I have a Neo4j database (version 2.0.0) containing words and their etymological relationships with other words. I am currently able to create "word networks" by traversing these word origins, using a variable depth Cypher query.
For client-side performance reasons (these networks are visualized in JavaScript), and because the number of relationships varies significantly from one word to the next, I would like to be able to make the depth traversal conditional on the number of nodes. My query currently looks something like this:
start a=node(id)
match p=(a)-[r:ORIGIN_OF*1..5]-(b)
where not b-->()
return nodes(p)
Going to a depth of 5 usually yields very interesting results, but at times delivers far too many nodes for my client-side visualization to handle. I'd like to check against, for example, sum(length(nodes(p))) and decrement the depth if that result exceeds a particular maximum value. Or, of course, any other way of achieving this goal.
I have experimented with adding a WHERE clause to the path traversal, but this is specific to individual paths and does not allow me to sum() the total number of nodes.
Thanks in advance!
What you're looking to do isn't fairly straight forward in a single query. Assuming you are using labels and indexing on the word property, the following query should do what you want.
MATCH p=(a:Word { word: "Feet" })-[r:ORIGIN_OF*1..5]-(b)
WHERE NOT (b)-->()
WITH reduce(pathArr =[], word IN nodes(p)| pathArr + word.word) AS wordArr
MATCH (words:Word)
WHERE words.word IN wordArr
WITH DISTINCT words
MATCH (origin:Word { word: "Feet" })
MATCH p=shortestPath((words)-[*]-(origin))
WITH words, length(nodes(p)) AS distance
RETURN words
ORDER BY distance
LIMIT 100
I should mention that this most likely won't scale to huge datasets. It will most likely take a few seconds to complete if there are 1000+ paths extending from your origin word.
The query basically does a radial distance operation by collecting all distinct nodes from your paths into a word array. Then it measures the shortest path distance from each distinct word to the origin word and orders by the closest distance and imposes a maximum limit of results, for example 100.
Give it a try and see how it performs in your dataset. Make sure to index on the word property and to apply the Word label to your applicable word nodes.
what comes to my mind is an stupid optimalization of graph:
what you need to do is to ad an information into each node, which will show up how many connections it has for each depth from 1 to 5, ie:
start a=node(id)
match (a)-[r:ORIGIN_OF*1..1]-(b)
with count(*) as cnt
set a.reach1 = cnt
...
start a=node(id)
match (a)-[r:ORIGIN_OF*5..5]-(b)
where not b-->()
with count(*) as cnt
set a.reach5 = cnt
then, before each run of your question query above, check if the number of reachX < you_wished_results and run the query with [r:ORIGIN_OF*X..X]
this would have some consequences - either you would have to run this optimalisation each time after new items or updates happens to your db, or after each new node /updated node you must add the reachX param to the update

Neo4j crashes on 4th degree Cypher query

neo4j noob here, on Neo4j 2.0.0 Community
I've got a graph database of 24,000 movies and 2700 users, and somewhere around 60,000 LIKE relationships between a user and a movie.
Let's say that I've got a specific movie (movie1) in mind.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)
RETURN usersLikingMovie1;
I can quickly and easily find the users who liked the movie with the above query. I can follow this path further to get the users who liked the same movies that as the people who liked movie1. I call these generation 2 users
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
RETURN usersGen2;
This query takes about 3 seconds and returns 1896 users.
Now I take this query one step further to get the movies liked by the users above (generation 2 movies)
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
This query causes neo4j to spin for several minutes at 100% cpu utilization and using 4GB of RAM. Then it sends back an exception "OutOfMemoryError: GC overhead limit exceeded".
I was hoping someone could help me out and explain to me the issue.
Is Neo4j not meant to handle a query of this depth in a performant manner?
Is there something wrong with my Cypher query?
Thanks for taking the time to read.
That's a pretty intense query, and the deeper you go the closer you're probably getting to a set of all users that ever rated any movie, since you're essentially just expanding out through the graph in tree form starting with your given movie. #Huston's WHERE and DISTINCT clauses will help to prune branches you've already seen, but you're still just expanding out through the tree.
The branching factor of your tree can be estimated with two values:
u, the average number of users that liked a movie (incoming to :Movie)
m, the average number of movies that each user liked (outgoing from :User)
For an estimate, your first step will return m users. On the next step, for each user you get all the movies each of them liked followed by all the users that liked all of those movies:
gen(1) => u
gen(2) => u * (m * u)
For each generation you'll tack on another m*u, so your third generation is:
gen(3) => u * (m * u) * (m * u)
Or more generically:
gen(n) => u^n * m^(n-1)
You could estimate your branching factors by computing the average of your likes/users and likes/movie, but that's probably very inaccurate since it gives you 22.2 likes/user and 2.5 likes/movie. Those numbers aren't reasonable for any movie that's worthy of rating. A better approach would be to take the median number of ratings or look at a histogram and use the peaks as your branching factors.
To put this in perspective, the average Netflix user rated 200 movies. The Netflix Prize training set had 17,770 movies, 480,189 users, and 100,480,507 ratings. That's 209 ratings/user and 5654 ratings/movie.
To keep things simple (and assuming your data set is much smaller), let's use:
m = 20 movie ratings/user
u = 100 users have rated/movie
Your query in gen-3 (without distincts) will return:
gen(3) = 100^3 * 20^2
= 400,000,000
400 million nodes (users)
Since you only have 2700 users, I think it's safe to say your query probably returns every user in your data set (rather, 148 thousand-ish copies of each user).
Your movie nodes in ASCII -- (n:Movie {movieid:"88cacfca-3def-4b2c-acb2-8e7f4f28be04"}) are 58 bytes minimum. If your users are about the same, let's say each node is 60 bytes, your storage requirement for this resultant set is:
400,000,000 nodes * 60 bytes
= 24,000,000,000 bytes
= 23,437,500 kb
= 22,888 Mb
= 22.35 Gb
So by my conservative estimates, your query requires 22 Gigabytes of storage. This seems quite reasonable that Neo4j would run out of memory.
My guess is that you're trying to find similarities in the patterns of users, but the query you're using is returning all the users in your dataset duplicated a bunch of times. Maybe you want to be asking questions of your data more like:
what users rate movies most like me?
what users rated most of the same movies as I rated
what movies have users that have rated similar movies to me watched that I haven't watched yet?
Cheers,
cm
To minimize the explosion that #cod3monk3y talks about, I'd limit the number of intermediate results.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
or even like this
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)
WITH distinct usersGen2
MATCH (usersGen2)-[:LIKES]->(moviesGen2)
RETURN distinct moviesGen2;
if you want to, you can use "profile start ..." in the neo4j shell to see how many hits / db-rows you create in between, starting with your query and then these two.
Cypher is a pattern matching language, and it is important to remember that the MATCH clause will always find a pattern everywhere it exists in the Graph.
The problem with the MATCH clause you are using is that sometimes Cypher will find different patterns where 'usersGen2' is the same as 'usersLikingMovie1' and where 'movie1' is the same as 'movieGen1' across different patterns. So, in essence, Cypher finds the pattern every single time it exists in the Graph, is holding it in memory for the duration of the query, and then returning all the moviesGen2 nodes, which could actually be the same node n number of times.
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
If you explicitly tell Cypher that the movies and users should be different for each match pattern it should solve the issue. Try this? Additionally, The DISTINCT parameter will make sure you only grab each 'moviesGen2' node once.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
WHERE movie1 <> moviesGen2 AND usersLikingMovie1 <> usersGen2
RETURN DISTINCT moviesGen2;
Additionally, in 2.0, the start clause is not required. So you can actually leave out the START clause all together (However - only if you are NOT using a legacy index and use labels)...
Hope this works... Please correct my answer if there are syntax errors...

Resources