neo4j - how to match only first n relations - neo4j

is there a default way how to match only first n relationships except that filtering on LIMIT n later?
i have this query:
START n=node({id})
MATCH n--u--n2
RETURN u, count(*) as cnt order by cnt desc limit 10;
but assuming the number of n--u relationships is very high, i want to relax this query and took for example first 100 random relationships and than continue with u--n2...
this is for a collaborative filtering task, and assuming the users are more-less similar i dont want to match all users u but a random subset. this approach should be faster in performance - now i got ~500ms query time but would like to drop it under 50ms.
i know i could break the above query into 2 separate ones, but still in the first query it goes through all users and than later it limits the output. i want to limit the max rels during match phase.

You can pipe the current results of your query using WITH, then LIMIT those initial results, and then continue on in the same query:
START n=node({id})
MATCH n--u
WITH u
LIMIT 10
MATCH u--n2
RETURN u, count(*) as cnt
ORDER BY cnt desc
LIMIT 10;
The query above will give you the first 10 us found, and then continue to find the first ten matching n2s.
Optionally, you can leave off the second LIMIT and you will get all matching n2s for the first ten us (meaning you could have more than ten rows returned if they matched the first 10 us).

This is not a direct solution to your question, but since I was running into a similar problem, my work-around might be interesting for you.
What I need to do is: get relationships by index (might yield many thousands) and get the start node of these. Since the start node is always the same with that index-query, I only need the very first relationship's startnode.
Since I wasn't able to achieve that with cypher (the proposed query by ean5533 does not perform any better), I am using a simple unmanaged extension (nice template).
#GET
#Path("/address/{address}")
public Response getUniqueIDofSenderAddress(#PathParam("address") String addr, #Context GraphDatabaseService graphDB) throws IOException
{
try {
RelationshipIndex index = graphDB.index().forRelationships("transactions");
IndexHits<Relationship> rels = index.get("sender_address", addr);
int unique_id = -1;
for (Relationship rel : rels) {
Node sender = rel.getStartNode();
unique_id = (Integer) sender.getProperty("unique_id");
rels.close();
break;
}
return Response.ok().entity("Unique ID: " + unique_id).build();
} catch (Exception e) {
return Response.serverError().entity("Could not get unique ID.").build();
}
}
For this case here, the speed up is quite nice.
I don't know your exact use case, but since Neo4j even supports HTTP streaming afaik, you should be able to create to convert your query to an unmanaged extension and still get the full performance.
E.g., "java-querying" all your qualifying nodes and emit the partial result to the HTTP stream.

Related

I want to range the nodes by degree - why is this Neo4J Cypher request so slow?

I want to first get all the nodes of a certain type connected to a context and then simply range them by their degree, but only for the (:TO) type of connection to the other nodes that belong to the same context. I tried several ways including the ones below but they are too slow (10s of seconds). Is there any way to make it faster?
MATCH (ctx:Context{uid:'60156a60-d3e1-11ea-9477-f71401ca7fdb'})<-[:AT]-(c1:Concept)
WITH c1 MATCH (c1)-[r:TO]-(c2:Concept)
WHERE r.context = '60156a60-d3e1-11ea-9477-f71401ca7fdb'
RETURN c2, count(r) as degree ORDER BY degree DESC LIMIT 10;
MATCH (ctx:Context{uid:'60156a60-d3e1-11ea-9477-f71401ca7fdb'})<-[:AT]-(c1:Concept)-[:TO]-(c2:Concept)
RETURN c1, count(c2) as degree
ORDER BY degree DESC LIMIT 10;
One way to examine degree is using the size function, have you tried something like this?
size((c1)-[:TO]-(:Concept))
In my graph size() appears to be more efficient, but it might be my cypher rearrangement as well.
Example: (in my graph) This statement is 81db hits
PROFILE MATCH (g:Gene {name:'ACE2'})-[r:EXPRESSED_IN]-(a)
return count(r)
And this is 4 db hits
PROFILE MATCH (g:Gene {name:'ACE2'})
return size((g)-[:EXPRESSED_IN]-())
I'm not sure this next suggestion is faster/more efficient, but if you always calculate degree on a single or subset of relationships, you might look into storing the degree values just to see if that might be an option (faster?).
I do this on my entire graph right after a bulk load
CALL apoc.periodic.iterate(
"MATCH (n) return n",
"set n.degree = size((n)--())",
{batchSize:50000, batchMode: "BATCH", parallel:true});
but for a different reason, I want to see the degree value in the neo4j browser (for example...) Note: I rebuilt my graphs daily from the ground up but then it is static until the next rebuild

Independent matches in cypher query

I have a Neo4j database with User, Content, and Topic nodes. I want to calculate the proportion of content consumed by a given user for a given topic.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
MATCH (c1:Content)<-[:CONTAINS]-(z)
RETURN toFloat(COUNT(DISTINCT(c))) / toFloat(COUNT(DISTINCT(c1)))
Two things strike me as really ugly here:
Firstly, is COUNT(DISTINCT()) a hack to get round the fact that the two MATCH queries cross-join?
Float division is ugly.
The second is something I can live with, but the first seems inefficient; is there a better way to express this idea?
The count of content should return the number of pieces of content a user consumed unless of course they consumed the same content more than once.
Instead of matching all of the content from the topic, if your model permits, you could just get the size of the outbound CONTAINS relationships.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
RETURN toFloat(count(distinct c))/ size((t)-[:CONTAINS]->()) as proportion
Your original query returns a cartesian product of the number of user-content-topic matches x the number of topic-content matches. As an alternative to the above, you could re-write your original query something like this. This gets the content that is consumed by a user for the topic, does the aggregation and then passes the topic and resulting count to the next clause in the query. This will work, however, using size((t)-[:CONTAINS]->()) will be more effiecient.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
WITH t, count(distinct c ) as distinct_content
MATCH (t)-[:CONTAINS]->(c1:Content)
RETURN toFloat(distinct_content) / count(c1)

Neo4J order by count relationships extremely slow

I'm trying to model a large knowledge graph. (using v3.1.1).
My actual graph contains only two types of Nodes (Topic, Properties) and a single type of Relationships (HAS_PROPERTIES).
The count of nodes is about 85M (47M :Topic, the rest of nodes are :Properties).
I'm trying to get the most connected node:Topic for this. I'm using the following query:
MATCH (n:Topic)-[r]-()
RETURN n, count(DISTINCT r) AS num
ORDER BY num
This query or almost any query I try to perform (without filtering the results) using the count(relationships) and order by count(relationships) is always extremely slow: these queries take more than 10 minutes and still no response.
Am i missing indexes or is the a better syntax?
Is there any chance i can execute this query in a reasonable time?
Use this:
MATCH (n:Topic)
RETURN n, size( (n)--() ) AS num
ORDER BY num DESC
LIMIT 100
Which reads the degree from a node directly.

Neo4j / Cypher query syntax feedback

I'm developing a kind of reddit service to learn Neo4j.
Everything works fine, I just want to get some feedback on the Cypher query to get the most recent news stories, the author and number of comments, likes and dislikes.
I'm using Neo4j 2.0.
MATCH comments = (n:news)-[:COMMENT]-(o)
MATCH likes = (n:news)-[:LIKES]-(p)
MATCH dislikes = (n:news)-[:DISLIKES]-(q)
MATCH (n:news)-[:POSTED_BY]-(r)
WITH n, r, count(comments) AS num_comments, count(likes) AS num_likes, count(dislikes) AS num_dislikes
ORDER BY n.post_date
LIMIT 20
RETURN *
o, p, q, r are all nodes with the label user. Should the label be added to the query to speed it up?
Is there anything else you see that I could optimize?
I think you're going to want to get rid of the multiple matches. Cypher will filter on each one, filtering through one another, rather than getting all the information.
I would also avoid the paths like comments, and rather do the count on the nodes you are saving. When you do MATCH xyz = (a)-[:COMMENT]-(b) then xyz is a path, which contains the source, relationship and destination node.
MATCH (news:news)-[:COMMENT]-(comment),(news:news)-[:LIKES]-(like),(news:news)-[:DISLIKES]-(dislike),(news:news)-[:POSTED_BY]-(posted_by)
WHERE news.post_date > 0
WITH news, posted_by, count(comment) AS num_comments, count(like) AS num_likes, count(dislike) AS num_dislikes
ORDER BY news.post_date
LIMIT 20
RETURN *
I would do something like this.
MATCH (n:news)-[:POSTED_BY]->(r)
WHERE n.post_date > {recent_start_time}
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20
To speed it up and have not neo search over all your posts, I would probably index the post-date field (assuming it doesn't contain time information). And then send this query in for today, yesterday etc. until you have your 20 posts.
MATCH (n:news {post_date: {day}})-[:POSTED_BY]->(r)
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20

Cypher order by aggregation

I need to aggregate data during the query and then order by this data.
According to cypher documentation:
If you want to use aggregations to sort your result set, the
aggregation must be included in the RETURN to be used in your ORDER
BY.
I have the following cypher query:
START profile=node(31) MATCH (profile)-[r:ROLE]->(story)
WHERE r.role="LEADER" and story.status="PRIVATE"
WITH story MATCH (story)<-[r?:RATED]-()
RETURN distinct story ,sum(r.rate) as rate ORDER BY rate DESCENDING
The above query works fine, the thing is I must include sum(r.rate) in my result set.
I am using Cypherdsl via repository ( my repository extends CypherDslRepository ) when quering the response should be story list/page...
Can I use order by aggregation function without including it in the result set?
Any workaround for that?
Thanks.
You can do it with an intermediate `WITH``
START profile=node(31) MATCH (profile)-[r:ROLE]->(story)
WHERE r.role="LEADER" and story.status="PRIVATE"
WITH story
MATCH (story)<-[r?:RATED]-()
WITH story ,sum(r.rate) as rate
ORDER BY rate DESCENDING
RETURN story
And leave off DISTINCT if you already have aggregation.
And optional relationships are slow, so if you run into an perf issue rather use a path expression and get the rel from there.
START profile=node(31) MATCH (profile)-[r:ROLE]->(story)
WHERE r.role="LEADER" and story.status="PRIVATE"
with story, extract(p in (story)<-[r?:RATED]-() : head(rels(p)) as rated
WITH story , reduce(sum = 0, r in rated : sum + r.rate) as rate
ORDER BY rate DESCENDING
RETURN story

Resources