Efficiently Exporting Relationships From Neo4J - neo4j

I have a relatively small but growing database (2M nodes, 5M relationships). Relationships often change. I periodically need to export the list of relationships for some other computations.
At present, I use a paginated query, but it gets slow as the value of skip increases
MATCH (a)-[r]->(b) RETURN ID(a) AS id1, ID(b) AS id2, TYPE(r) AS r_type
SKIP %d LIMIT 1000
I am using py2neo. The relevant bit of code:
while (count <= num_records):
for record in graph.cypher.stream(cq % (skip, limit)):
id1 = record["id1"]
id2 = record["id2"]
r_type = record["r_type"]
Is there a better / more efficient way to do this?
Thanks in advance.

You don't have to skip / limit in the first place.
Neo can easily output gigabytes of data.
See this blog post for another way of doing that: http://neo4j.com/blog/export-csv-from-neo4j-curl-cypher-jq/
You can also use Save as CSV in Neo4j Browser after you ran a query.

Related

How to use With clause for Neo4j Cypher subquery formulation?

I trying to create a simple cypher query that should find all instances in the graph matching roughly this structure (BlogPost A) -> (Term) <- (BlogPost B). This means, I am trying all pairs of blog posts that are flagged with the same term and moreover count the number of terms. A term is a mechanism of categorization in this context.
Here is my query proposal:
MATCH (blogA:content {entitySubType:'blog'})
WITH blogA MATCH (blogA) -[]-> (t:term) <-[]- (blogB:content)
WHERE blogB.entitySubType='blog' AND NOT (ID(blogA) = ID(blogB))
RETURN ID(blogA), ID(blogB), count(t) ;
This query ends with null after ~1 day.
Is the uasge of blogA in the subquery not possible in the way I am using it? When using the same query with limits I do get reuslts:
MATCH (blogA:content {entitySubType:'blog'})
WITH blogA
LIMIT 10
MATCH (blogA) -[]-> (t:term) <-[]- (blogB:content)
WHERE blogB.entitySubType='blog' AND NOT (ID(blogA) = ID(blogB))
RETURN ID(blogA), ID(blogB), count(t)
LIMIT 20;
My Neo4j Instance has ~500GB RAM and the whole graph inclduing all properties is ~30 GB with ~15 million vertices in total, whereas there are 101k blog vertices and 108k terms.
I would be grateful for every hint about possible problems or suggestions for improvements.
Also make sure to consume that query with a client driver (e.g. Java) that can stream the billions of results. Here is a query that would use the compiled runtime which should be fastest and most memory efficient.
MATCH (blogA:Blog)-[:TAGGED]->(t:Term)<-[:TAGGED]-(blogB:Blog)
WHERE blogA <> blogB
RETURN ID(blogA), ID(blogB), count(t);

Neo4J order by count relationships extremely slow

I'm trying to model a large knowledge graph. (using v3.1.1).
My actual graph contains only two types of Nodes (Topic, Properties) and a single type of Relationships (HAS_PROPERTIES).
The count of nodes is about 85M (47M :Topic, the rest of nodes are :Properties).
I'm trying to get the most connected node:Topic for this. I'm using the following query:
MATCH (n:Topic)-[r]-()
RETURN n, count(DISTINCT r) AS num
ORDER BY num
This query or almost any query I try to perform (without filtering the results) using the count(relationships) and order by count(relationships) is always extremely slow: these queries take more than 10 minutes and still no response.
Am i missing indexes or is the a better syntax?
Is there any chance i can execute this query in a reasonable time?
Use this:
MATCH (n:Topic)
RETURN n, size( (n)--() ) AS num
ORDER BY num DESC
LIMIT 100
Which reads the degree from a node directly.

How can I optimise my neo4j cypher query?

Please check my Cypher below, I am getting result with the query below() with low records but as records increases it take a long time about 1601152 ms:
i found suggestion to add USING INDEX and and I apply the USING INDEX in query.
PROFILE MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person)-[:WATCHED]->(ma:Movie)-[:HAS_TAG]->(t:Tag)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
USING INDEX a:App(app_id) WHERE p.person_id= '1'
AND NOT (p:Person)-[:WATCHED]-(mb)
RETURN DISTINCT(mb.movie_id) , mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(DISTINCT(t.tag_id)) as Tag, count(DISTINCT(t.tag_id)) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50
Can you help me out what can I do?
I am trying to find 100 movies for recommendation on basis of tags, as 100 movies which I do not watch and match with tags of Movies I watched.
The following query may work better for you [assuming you have indexes on both :App(app_id) and :Person(person_id)]. By the way, I presumed that in your query the identifier ma should have been m (or vice versa).
MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person {person_id: '1'})-[:WATCHED]->(m)
WITH a, p, COLLECT(m) AS movies
UNWIND movies AS movie
MATCH (movie)-[:HAS_TAG]->(t)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
WHERE NOT mb IN movies
WITH DISTINCT mb, t
RETURN mb.movie_id, mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(t.tag_id) as Tag, COUNT(t.tag_id) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50;
If you PROFILE this query, you should see that it performs NodeIndexSeek operations (instead of the much slower NodeByLabelScan) to quickly execute the first MATCH. The query also collects all the movies watched by the specified person and uses that collection later to speed up the WHERE clause (which no longer needs hit the DB). In addition, the query removed some labels from some of the node patterns (where doing so seemed likely to be unambiguous) to speed up processing further.

Optimize neo4j cypher query with very large dataset

I'm trying to figure out how to optimize a cypher query on a very large dataset. I'm trying to find 2nd or 3rd degree friends in the same city. My current cypher query is, which takes over 1 minute to run:
match (n:User {id: 123})-[:LIVES_IN]->()<-[:LIVES_IN]-(u:User), (n)-[:FRIENDS_WITH*2..3]-(u) WHERE u.age >= 20 AND u.age <= 36 return u limit 100
There are approximately 500K User nodes and 500M FRIENDS_WITH relationships. I already have indexes on the id and age properties. The query seems to be choking on the FRIENDS_WITH requirement. Is there any way to think about this in a different way or optimize the cypher to make it real-time (i.e., max time 1-2 seconds)?
Here's the profile of the query:
Thanks.
Create index on id property for label User:
CREATE INDEX ON :User(id)
See documentation for schema indexes for more information http://neo4j.com/docs/stable/query-schema-index.html
If that doesn't help add a result of PROFILE query and we might be able to help you more
PROFILE MATCH ... rest of your query
Also it might be worth trying rewriting the query the following way:
MATCH (n:User {id: 123})-[:LIVES_IN]->()<-[:LIVES_IN]-(u:User),
(n)-[:FRIENDS_WITH*2..3]-(u)
WHERE u.age >= 20 AND u.age <= 36
return u limit 100

Cypher performance in graph with large number of relatinships from one node

I have a Neo4j graph (ver. 2.2.2) with large number of relationships. For examaple: 1 node "Group", 300000 nodes "Data", 300000 relationships from "Group" to all existing nodes "Data". I need to check if there is a relationship between set of Data nodes and specific Group node (for example for 200 nodes). But Cypher query I used is very slow. I tried many modifications of this cypher but with no result.
Cypher to create graph:
FOREACH (r IN range(1,300000) | CREATE (:Data {id:r}));
CREATE (:Group);
MATCH (g:Group),(d:Data) create (g)-[:READ]->(d);
Query 1: COST. 600003 total db hits in 730 ms.
Acceptable but I asked only for 1 node.
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000] AND id(g)=300000 RETURN id(d);
Query 2: COST. 600003 total db hits in 25793 ms.
Not acceptable.
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] AND id(g)=300000 RETURN id(d);
Query 3: COST. 1000 total db hits in 309 ms.
This is only one solution I found to make query acceptable. I returned all ids of nodes "Group" and manualy filter result in my code to return only relationships to node with id 300000
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] RETURN id(d), id(g);
Question 1: Total DB hits in query 1 is surprising but I accept that physical model of neoj defines how this query is executed - it needs to look into every existing relation from node "Group". I accept that. But why is so big difference in execution time between query 1 and query 2 if number of db hits is the same (and exucution plan is the same)? I'm only returning id of node, not large set of properties.
Question 2: Is a query 3 the only one solution to optimize this query?
Apparently there is an issue with Cypher in 2.2.x with the seekById.
You can prefix your query with PLANNER RULE in order to make use of the previous Cypher planner, but you'll have to split your pattern in two for making it really fast, tested e.g. :
PLANNER RULE
MATCH (d:Data) WHERE id(d) IN [30]
MATCH (g:Group) WHERE id(g) = 300992
MATCH (d)<-[:READ]-(g)
RETURN id(d)

Resources