I need to aggregate data during the query and then order by this data.
According to cypher documentation:
If you want to use aggregations to sort your result set, the
aggregation must be included in the RETURN to be used in your ORDER
BY.
I have the following cypher query:
START profile=node(31) MATCH (profile)-[r:ROLE]->(story)
WHERE r.role="LEADER" and story.status="PRIVATE"
WITH story MATCH (story)<-[r?:RATED]-()
RETURN distinct story ,sum(r.rate) as rate ORDER BY rate DESCENDING
The above query works fine, the thing is I must include sum(r.rate) in my result set.
I am using Cypherdsl via repository ( my repository extends CypherDslRepository ) when quering the response should be story list/page...
Can I use order by aggregation function without including it in the result set?
Any workaround for that?
Thanks.
You can do it with an intermediate `WITH``
START profile=node(31) MATCH (profile)-[r:ROLE]->(story)
WHERE r.role="LEADER" and story.status="PRIVATE"
WITH story
MATCH (story)<-[r?:RATED]-()
WITH story ,sum(r.rate) as rate
ORDER BY rate DESCENDING
RETURN story
And leave off DISTINCT if you already have aggregation.
And optional relationships are slow, so if you run into an perf issue rather use a path expression and get the rel from there.
START profile=node(31) MATCH (profile)-[r:ROLE]->(story)
WHERE r.role="LEADER" and story.status="PRIVATE"
with story, extract(p in (story)<-[r?:RATED]-() : head(rels(p)) as rated
WITH story , reduce(sum = 0, r in rated : sum + r.rate) as rate
ORDER BY rate DESCENDING
RETURN story
Related
match(m:master_node:Application)-[r]-(k:master_node:Server)-[r1]-(n:master_node)
where (m.name contains '' and (n:master_node:DeploymentUnit or n:master_node:Schema))
return distinct m.name,n.name
Hi,I am trying to get total number of records for the above query.How I change the query using count function to get the record count directly.
Thanks in advance
The following query uses the aggregating funtion COUNT. Distinct pairs of m.name, n.name values are used as the "grouping keys".
MATCH (m:master_node:Application)--(:master_node:Server)--(n:master_node)
WHERE EXISTS(m.name) AND (n:DeploymentUnit OR n:Schema)
RETURN m.name, n.name, COUNT(*) AS cnt
I assume that m.name contains '' in your query was an attempt to test for the existence of m.name. This query uses the EXISTS() function to test that more efficiently.
[UPDATE]
To determine the number of distinct n and m pairs in the DB (instead of the number of times each pair appears in the DB):
MATCH (m:master_node:Application)--(:master_node:Server)--(n:master_node)
WHERE EXISTS(m.name) AND (n:DeploymentUnit OR n:Schema)
WITH DISTINCT m.name AS n1, n.name AS n2
RETURN COUNT(*) AS cnt
Some things to consider for speeding up the query even further:
Remove unnecessary label tests from the MATCH pattern. For example, can we omit the master_node label test from any nodes? In fact, can we omit all label testing for any nodes without affecting the validity of the result? (You will likely need a label on at least one node, though, to avoid scanning all nodes when kicking off the query.)
Can you add a direction to each relationship (to avoid having to traverse relationships in both directions)?
Specify the relationship types in the MATCH pattern. This will filter out unwanted paths earlier. Once you do so, you may also be able to remove some node labels from the pattern as long as you can still get the same result.
Use the PROFILE clause to evaluate the number of DB hits needed by different Cypher queries.
You can find examples of how to use count in the Neo4j docs here
In your case the first example where:
count(*)
Is used to return a count of each returned item should work.
I find it hard to explain, so consider the following picture
I'm trying to select all products that fulfill the warehouse requirements
In this example I need to select all products that have a maximum size of 5 AND maximum weight of 10.
To simplify, I only have MAX (no MIN or EQ) constraints, so the operator can be hardcoded.
I've tried to group the requirement subgraph using COLLECT and using the ALL operator, but failed.
Query to create the graph
CREATE
// NODES
(warehouse:WAREHOUSE{name:'My Warehouse'}),
(smallProduct:PRODUCT{name:'Small Product'}),
(largeProduct:PRODUCT{name:'Large Product'}),
// RELATIONSHIPS
(size:CONSTRAINT{name:'Size'}),
(weight:CONSTRAINT{name:'Weight'}),
(warehouse)-[:LIMIT{value:5}]->(size),
(warehouse)-[:LIMIT{value:5}]->(weight),
(smallProduct)-[:AMOUNT{value:3}]->(size),
(smallProduct)-[:AMOUNT{value:2}]->(weight),
(largeProduct)-[:AMOUNT{value:10}]->(size),
(largeProduct)-[:AMOUNT{value:4}]->(weight)
UPDATE
The following query apparently solves the problem:
MATCH (warehouse:WAREHOUSE)
MATCH rel = ((warehouse)-[limit:LIMIT]->(constraint:CONSTRAINT)<-[amount:AMOUNT]-(product:PRODUCT))
WITH warehouse, product, collect(relationships(rel)) as paths
WHERE all( p in paths WHERE p[0].value > p[1].value )
return product
I am wondering if there is a better solution.
I have some sample tweets stored as neo4j. Below query finds top hashtags from specific country. It is taking a lot of time because the time filter for status type nodes is in where clause and is slowing the response. Is it possible to move this filter to MATCH clause so that status nodes are filtered before relationships are found?
match (c:country{countryCode:"PK"})-[*0..4]->(s:status)-[*0..1]->(h:hashtag) where (s.createdAt >= datetime('2017-06-01T00:00:00') AND s.createdAt
>= datetime('2017-06-01T23:59:59')) return h.name,count(h.name) as hCount order by hCount desc limit 100
thanks
As mentioned in my comment, whether a predicate for a property is in the MATCH clause or the WHERE clause shouldn't matter, as this is just syntactical sugar and is interpreted the same way by the query planner.
You can use PROFILE or EXPLAIN to see the query plan to see what it's doing. PROFILE will give you more information but will have to actually execute the query. You can attempt to use planner hints to force the planner to plan the match differently which may yield a better approach.
You will want to ensure you have an index on :status(createdAt).
You can also try altering your match a little, and moving the portion connecting to the country in question into your WHERE clause instead. Also it's a good idea to get the count based upon the hashtag node itself (assuming there's only one :hashtag node for a given name) so you can order and limit before you do property access:
MATCH (s:status)-[*0..1]->(h:hashtag)
WHERE (s.createdAt >= datetime('2017-06-01T00:00:00') AND s.createdAt
>= datetime('2017-06-01T23:59:59'))
AND (:country{countryCode:"PK"})-[*0..4]->(s)
WITH h, count(h) as hCount
ORDER BY hCount DESC
LIMIT 100
RETURN h.name, hCount
Please check my Cypher below, I am getting result with the query below() with low records but as records increases it take a long time about 1601152 ms:
i found suggestion to add USING INDEX and and I apply the USING INDEX in query.
PROFILE MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person)-[:WATCHED]->(ma:Movie)-[:HAS_TAG]->(t:Tag)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
USING INDEX a:App(app_id) WHERE p.person_id= '1'
AND NOT (p:Person)-[:WATCHED]-(mb)
RETURN DISTINCT(mb.movie_id) , mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(DISTINCT(t.tag_id)) as Tag, count(DISTINCT(t.tag_id)) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50
Can you help me out what can I do?
I am trying to find 100 movies for recommendation on basis of tags, as 100 movies which I do not watch and match with tags of Movies I watched.
The following query may work better for you [assuming you have indexes on both :App(app_id) and :Person(person_id)]. By the way, I presumed that in your query the identifier ma should have been m (or vice versa).
MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person {person_id: '1'})-[:WATCHED]->(m)
WITH a, p, COLLECT(m) AS movies
UNWIND movies AS movie
MATCH (movie)-[:HAS_TAG]->(t)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
WHERE NOT mb IN movies
WITH DISTINCT mb, t
RETURN mb.movie_id, mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(t.tag_id) as Tag, COUNT(t.tag_id) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50;
If you PROFILE this query, you should see that it performs NodeIndexSeek operations (instead of the much slower NodeByLabelScan) to quickly execute the first MATCH. The query also collects all the movies watched by the specified person and uses that collection later to speed up the WHERE clause (which no longer needs hit the DB). In addition, the query removed some labels from some of the node patterns (where doing so seemed likely to be unambiguous) to speed up processing further.
is there a default way how to match only first n relationships except that filtering on LIMIT n later?
i have this query:
START n=node({id})
MATCH n--u--n2
RETURN u, count(*) as cnt order by cnt desc limit 10;
but assuming the number of n--u relationships is very high, i want to relax this query and took for example first 100 random relationships and than continue with u--n2...
this is for a collaborative filtering task, and assuming the users are more-less similar i dont want to match all users u but a random subset. this approach should be faster in performance - now i got ~500ms query time but would like to drop it under 50ms.
i know i could break the above query into 2 separate ones, but still in the first query it goes through all users and than later it limits the output. i want to limit the max rels during match phase.
You can pipe the current results of your query using WITH, then LIMIT those initial results, and then continue on in the same query:
START n=node({id})
MATCH n--u
WITH u
LIMIT 10
MATCH u--n2
RETURN u, count(*) as cnt
ORDER BY cnt desc
LIMIT 10;
The query above will give you the first 10 us found, and then continue to find the first ten matching n2s.
Optionally, you can leave off the second LIMIT and you will get all matching n2s for the first ten us (meaning you could have more than ten rows returned if they matched the first 10 us).
This is not a direct solution to your question, but since I was running into a similar problem, my work-around might be interesting for you.
What I need to do is: get relationships by index (might yield many thousands) and get the start node of these. Since the start node is always the same with that index-query, I only need the very first relationship's startnode.
Since I wasn't able to achieve that with cypher (the proposed query by ean5533 does not perform any better), I am using a simple unmanaged extension (nice template).
#GET
#Path("/address/{address}")
public Response getUniqueIDofSenderAddress(#PathParam("address") String addr, #Context GraphDatabaseService graphDB) throws IOException
{
try {
RelationshipIndex index = graphDB.index().forRelationships("transactions");
IndexHits<Relationship> rels = index.get("sender_address", addr);
int unique_id = -1;
for (Relationship rel : rels) {
Node sender = rel.getStartNode();
unique_id = (Integer) sender.getProperty("unique_id");
rels.close();
break;
}
return Response.ok().entity("Unique ID: " + unique_id).build();
} catch (Exception e) {
return Response.serverError().entity("Could not get unique ID.").build();
}
}
For this case here, the speed up is quite nice.
I don't know your exact use case, but since Neo4j even supports HTTP streaming afaik, you should be able to create to convert your query to an unmanaged extension and still get the full performance.
E.g., "java-querying" all your qualifying nodes and emit the partial result to the HTTP stream.