I am working on a project that uses Node.js, Cypher, and Neo4j. The project's front end occasionally needs to QUICKLY pull a random user. I have seen this query on the internet:
MATCH (n:User) WHERE rand() < 0.1 RETURN n LIMIT 21
but I have no idea what this does. It seems pretty fast, but I would like to understand it. A breakdown of what I know:
MATCH | Match some nodes
(n:User) | Let's call this node n, and it has to be of type User
WHERE | Specify conditions for node match
rand() | Return a random number from 0 to 0.9999...
< | Less than
0.1 | ??
RETURN | Give back the matched node(s)
n | Our node(s)
LIMIT 21 | Don't return more than 21 nodes
What does the rand() and 0.1 do? Does it somehow limit the potential nodes to return?
If this helps, I have around 10,000 nodes
As your question already states, a WHERE clause specifies the conditions for a MATCH to succeed. So, WHERE rand() < 0.1 means the MATCH has a 10% probability of succeeding.
Related
I have a graph as acyclic tree with undefined depth. I need to count number of descendants for each node including node itself. So the final result should be something like that:
9
|\
4 4
|\ \
2 1 3
| |\
1 1 1
So for each node this number would be sum of numbers of its descendants + 1.
How can it be done in one query?
I could come up with something like that:
MATCH (n)
SET n.count = SIZE((n)<-[:PARENT*0..]-());
But it means a subquery for each node. Having over 1 300 000 nodes it takes ages.
Better way would be to set "1" for each leaf and ascend to the root calculating each node. Is it possible to do in one query?
I'd go for
MATCH (start)<-[:PARENT*0..]-(n)
RETURN id(start), count(n) as numberOfChildren
which counts how many nodes are found on the path. But I don't know how it performs on really large graphs (my test graph has only ~100s nodes).
You could already optimize your query by limiting the number of paths you are processing, e.g. like this :
MATCH (n)
WHERE EXISTS((n)<-[:PARENT]-())
MATCH path=(n)<-[:PARENT*0..]-(m)
WHERE NOT EXISTS((m)<-[:PARENT]-())
UNWIND nodes(path) AS node
WITH n, COUNT(DISTINCT node) AS count
SET n.count = count
Background
I want to create a histogram of the relationships starting from a set of nodes.
Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ].
The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):
id_0:
- ACTED_IN: 14
- DIRECTED: 1
id_1:
- DIRECTED: 12
- WROTE: 5
- ACTED_IN: 2
id_2:
...
The current cypher query I've written is:
MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count
It returns the pair of [ id, type ] count like:
id | rel type | count
id0 | ACTED_IN | 14
id0 | DIRECTED | 1
id1 | DIRECTED | 12
id1 | WROTE | 5
id1 | ACTED_IN | 2
...
The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).
Problem
Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.
Is there more efficient way to collect the histogram for a set of nodes?
Could this query be parallelized? (With java code or using UNION?)
Is something wrong with how I set up my neo4j database, should these queries be this slow?
There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.
Bit of background :
Using count, will execute an expandAll, which is as expensive as the number of relationships a node has
PROFILE
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)
Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient
PROFILE
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())
Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).
But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.
The flow is :
Get all the relationship types from the database ( fast )
Get the nodes from id list ( fast )
Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL apoc.cypher.run("RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count
I have been struggling to find a solution to my issue with Cypher..
I am trying to sum a given relationship throughout a path.
For example:
1 --> 2 --> 3
--> 4
I want to calculate for node 1 the sum of Amount property for nodes 1,2 3 and 4. (In that case 3 and 4 are both targets of node 2, which i cant manage to represent here)
My understanding is I need to be using collect() and reduce but I still do not get the right answer. I have the following:
MATCH (n)-[p]->(m)
WITH m,n, collect(m) AS amounts
RETURN n.ID as Source,m.ID as Target,n.Amount,
REDUCE(total = 0, tot IN amounts | total + tot.Amount) AS totalEUR
ORDER BY total DESC
I get a syntax error, but I am pretty sure even without the syntax error that i will only be summing direct relationships...
Would you guys know if I am on the right path?
Cheers
Max
You need a variable-length relationship in the query:
MATCH p = (n)-[*]->(m)
RETURN n.ID as Source, m.ID as Target, n.Amount,
reduce(total = 0, tot IN nodes(p) | total + tot.Amount) AS totalEUR
ORDER BY totalEUR DESC
You can't order by total which is a variable local to the reduce function.
Note that this will return rows for each path, i.e. 1-->2, 1-->2-->3, 1-->2-->3-->4, 2-->3, 2-->3-->4, 3-->4, since you haven't matched on a specific n.
Suppose I have a neo4j database with a single node type and a single relationship type to keep things simple. All relationships have a "cost" property (as in classical graph problems), whose values are non-negative.
Suppose now I want to find all the possible paths between node with ID A and node with ID B, with an upper bound on path length (e.g. 10) such that the total path cost is below or equal to a given constant (e.g. 20).
The Cypher code to accomplish this is the following (and it works):
START a = node(A), b = node(B)
MATCH (a) -[r*0..10]-> (b)
WITH extract(x in r | x.cost) as costs, reduce(acc = 0, x in r | acc + x.cost) as totalcost
WHERE totalcost < 20
RETURN costs, totalcost
The problem with this query is that it doesn't take advange of the fact that costs are non-negative and thus paths where the total cost limit is passed can be pruned. Instead, it lists all possible paths of length 0 to 10 between nodes A and B (which can be ridiculously expensive), calculates total costs and then filters out paths that fall above the limit. Pruning paths in time would lead to massive performance improvements.
I know this is doable with the traversal framework by using BranchStates and preventing expansion when relevant, but I would like to find a Cypher solution (mainly due to the reasons exposed here).
I am currently using version 2.2.2, if that matters.
Would a sum of relationships costs before the extract be sufficient ?
START a = node(A), b = node(B)
MATCH (a)-[r*0..10]->(b)
WHERE sum(r.cost) < 20
WITH extract(x in r | x.cost) as costs, reduce(acc = 0, x in r | acc + x.cost) as totalcost
RETURN costs, totalcost
By the way, wanting to prune is meaning you want imperative way !
Also, please help Cypher a bit, use labels
Looking for a little assistance on a Cypher query. Given a set of customers peer who own book p, I am able to retrieve a set of customers target who own at least one book also owned by peer but who don't own p. This is accomplished using the following query:
match
(p:Book {isbn:"123456"})<-[:owns]-(peer:Customer)
-[:owns]->(other:Book)<-[o:owns]-(target:Customer)
WHERE NOT( (target)-[:owns]->(p))
return target.name
limit 10;
My next step is to determine how many other books each member of the target set own, and order those members accordingly. I've attempted several variations based on the Neo4j documentation and SO answers, but am having no luck. For instance I tried using with:
match
(p:Book {isbn:"123456"})<-[:owns]-(peer:Customer)
-[:owns]->(other:Book)<-[o:owns]-(target:Customer)
WHERE NOT( (target)-[:owns]->(p))
WITH target, count(o) as co
WHERE co > 1
return target.name
limit 10;
I also tried what seems to my novice eye was the most reasonable query:
match
(p:Book {isbn:"123456"})<-[:owns]-(peer:Customer)
-[:owns]->(other:Book)<-[o:owns]-(target:Customer)
WHERE NOT( (target)-[:owns]->(p))
return target.name, count(o)
limit 10;
In both of these cases, the query just runs without end (upwards of 10 minutes before I stop execution). Any insight into what I'm doing wrong?
EDIT
As it turns out this latter query does execute but takes 15 minutes to complete and is reporting incorrect numbers, as evidenced here:
+-------------------------------+
| target.name | count(o) |
+-------------------------------+
| "John Smith" | 12840 |
| "Mary Moore" | 11501 |
+-------------------------------+
I'm looking for the number of books each customer specifically owns, not sure where these 12840 and 11501 numbers are coming from really. Any thoughts?
How about this one:
MATCH (p:Book {isbn:"123456"})<-[:owns]-(peer:Customer)
WITH distinct peer, p
MATCH (peer)-[:owns]->(other:Book)
WITH distinct other, p
MATCH (other)<-[o:owns]-(target:Customer)
WHERE NOT((target)-[:owns]->(p))
RETURN target.name, count(o)
LIMIT 10;