I use query
"START a=node("+str(node1)+"),
b =node("+str(node2)+")
MATCH p=shortestPath(a-[:cooperate*..200]-b)
RETURN length(p)"
to see the path between a and b. I have many nodes, so when i run the query, sometimes it runs fast and sometimes run slowly.I use neo4j 1.9 community. Can anyone helps?
Query time is proportional to the amount of the graph searched. Your query allows for very deep searches, up to depth 200. If a. and b. are very close, you'll not search much of the graph, and the query will return very fast. If a. and b. are separated by 200 edges, you will search a very large swathe of graph (perhaps the whole graph?), which for a large graph will be much slower.
Is the graph changing between the two queries, is it possible these two nodes end up in different places in relation to eachother between the queries? For example if you generate some random data to populate the graph?
Related
I make experiments with querying Neo4j data whose size gradually increases. Some of my query expressions behave as I would expect - a larger graph means a longer execution time. But, e.g., the query
MATCH (`g`:Graph)
MATCH (`g`)<-[:`Graph`]-(person)-[:birthPlace {Value: "http://persons/data/birthPlace"}]->(place)-[:`Graph`]->(`g`)
WITH count(person.Value) AS persons, place
WHERE persons > 5
RETURN place.Value AS place, persons
ORDER BY persons
has these execution times (in ms):
|80.2 |208.1 |301.7 |399.23 |0.1 |2.07 |2.61 |2.81 |7.3 |1.5 |.
How to explain the rapid acceleration from the fifth step? The data are the same, just extended; no indexes were created.
The data on 4th experiment:
201673 nodes,
601189 relationships,
859225 properties.
The data size on the 5th experiment:
242113 nodes,
741500 relationships,
1047060 properties.
All I can think about is that maybe Cypher will start using some indexes from a certain data size, but I can't find anywhere if that's the case.
Thank you for any comments.
Neo4j cache management may explain your observations. You might explain what you are doing more precisely. What version of Neo4j are you using? What is the Graph node? You are repeatedly running the same query on graph and trying this again with a larger or smaller graph?
If you are running the same query multiple times on the same data set with more rapid execution times, then the cache may be the reason. In v 3.5 and earlier it would "warm up" with repeated execution. Putatively this does not occur in v 4.x.
You might look at
cold start
or these tips. You might also look at your transaction log; is it accumulate large files.
Why the '' around your node identifiers ('g'); just use (g:Graph) and [r:Graph]; no quotes.
I am using Neo4j community edition embedded in java application for recommendation purpose. I made a custom function which contains a complex logic of comparing two entities, namely product and users. Both entities are present as nodes in graph and has more than 20 properties each for comparison purpose. For eg. I am calling this function in following format:
match (e:User {user_id:"some-id"}) with e
match (f:Product {product_id:"some-id"}) with e,f
return e,f,findComparisonValue(e,f) as pref_value;
This function call on an average takes about 4-5 ms to run. Now, to recommend best product to a particular user, I wrote a cypher query which iterates over all products, calculate the pref_value and rank them. My cypher query looks like this:
MATCH (source:User) WHERE id(source)={id} with source
MATCH (reco:Product) WHERE reco.is_active='t'
with reco, source, findComparisonValue(source, reco) as score_result
RETURN distinct reco, score_result.score as score, score_result.params as params, score_result.matched_keywords as matched_keywords
order by score desc
Some insights on graph structure:
Total Number of nodes: 2 million
Total Number of relationships: 20 million
Total Number of Users: 0.2 million
Total Number of Products: 1.8 million
The above cypher query is taking more than 10 seconds as it is iterating over all the products. On top of this cypher query, I am using graphaware-reco module for my recommendation needs (Using precompute, filteing, post processing etc). I thought of parallelising this but community edition does not support clustering. Now, as number of users in system is increasing day by day, I need to think of a scalable solution.
Can anyone help me out here, on how to optimize the query.
As others have commented, doing a significant calculation potentially millions of times in a single query is going to be slow, and does not take advantage of neo4j's strengths. You should investigate modifying your data model and calculation so that you can leverage relationships and/or indexes.
In the meantime, there are a number of things to suggest with your second query:
Make sure you have created an index for :Product(is_active), so that it is not necessary to scan all products. (By the way, if that property is actually supposed to be a boolean, then consider making it a boolean rather than a string.)
The RETURN clause should not need the DISTINCT operator, since all the result rows should be distinct anyway. This is because every reco value is already distinct. Removing that keyword should improve performance.
I have a question related to the NEO4J
when I create the nodes arround 40,000 and their relationships which are around 20,000 it means, 4K nodes and 6![enter image description here][1]K relationships which are less than 15MB for sure.
when I run a query
"match (n) optional match (n)-[r]-() return n,r: "
it starts to load and after waiting for long time it returns nothing (in graphical form). But in the resultant file it shows how many nodes and relationships I have but no graphs . I want to see the complete graph of my data. is there anyway to see how does it look like, its only to visualize. When I limit the query till 800 it works.
Is there anything I need to change in settings or in my system memory?
any suggestion for that?
The web console isn't very good for more than the hundreds of nodes scale. I'd suggest looking at Gephi:
http://gephi.github.io/
Alternatively you could use Linkurious, an online tool:
https://linkurio.us/
If you want to roll your own there are a number of choices out there. I like Sigma.js:
http://sigmajs.org/
Linkurious also has a library based on Sigma:
https://github.com/Linkurious/linkurious.js
EDIT: http://keylines.com/ is another online service like Linurious
I was reading a book recommended on Neo4j site: http://neo4j.com/books/graph-databases/ about graph database performance and it said:
"In contrast to relational databases, where join-intensive query performance deteriorates
as the dataset gets bigger, with a graph database performance tends to remain
relatively constant, even as the dataset grows. This is because queries are localized to a
portion of the graph. As a result, the execution time for each query is proportional only
to the size of the part of the graph traversed to satisfy that query, rather than the size of
the overall graph."
So e.g. I want to return only nodes with a label "Doctor, that's localized to a portion of a graph. But my question is how does the database itself know where those nodes are ? In other words, does it not need to traverse all nodes to find out whether or not they satisfy the query and make decision based on that ?
Neo4j has a special indexing for node labels so that it can find all nodes for a label without searching all nodes. Beyond that you can:
Create your own indexes based on node properties (either schema indexes or legacy indexes) in order to find nodes as starting points
Query by node IDs to find a starting point (though I'd suggest using your own property with an index if you need to identify nodes more permanently)
In general localized searches mean: you start from a smallish set of starting points which can be people, products, places, orders etc.
A portion of the graph that is annotated with a label, often doesn't fall into that category, i.e. all doctors are not a smallish set of starting points.
Your query would probably touch a large portion of the graph if you traverse out from all doctors to their neighborhoods.
A query like this would be a graph local one:
MATCH (:City {name:"SFO"})<-[:RESIDES_IN]-(d:Doctor)-[presc:PRESCRIBES]->(m:Medicine)
RETURN d.name, m.name, sum(presc.amount) as amount
I'm not sure that title is worded very well, but not sure how else to. I'm populating a Neo4j database with some data.
The data is mainly generated from data I have about pairs of users. There is a percentage relationship between users, such as:
80
A ---> B
But the reverse relation is not the same:
60
A <--- B
I could put both relations in, but I think what I might do, is use the mean:
70
A <--> B
But I'm open to suggestions here.
What I want to do in Neo4j, is get groups of related users. So for example, I want to get a group of users with mean % relation > 50. So, if we have:
A
40 / \ 60
B --- C ------ D
20 70
We'd get back a subset, something like:
A
\ 60
C ------ D
70
I have no idea how to do that. Another thing is that I'm pretty sure it'd not be possible to reach any node from any other node, I think it's disjointed. Like, several large graphs. But I'd like to be able to get everything that falls into the above, even if some groups of nodes are completely separate from other nodes.
As an idea for numbers, there will be around 100,000 nodes and 550,000 edges
A couple thoughts.
First, it's fine if your graph isn't connected, but you need some way to access every component you want to analyze. In a disconnected graph in Neo4j, that either means Lucene indexing, some external "index" that holds node or relationship ids, or iterating through all nodes or relationships in the DB (which is slow, but might be required by your problem).
Second, though I don't know your domain, realize that sticking with the original representation and weights might be fine. Neo doesn't have undirected edges (though you can just ignore edge direction), and you might need that data later. OTOH, your modification taking the mean of the weights does simplify your analysis.
Third- depending on the size and characteristics of your graph, this could be very slow. It sounds like you want all connected components in the subgraph built from the edges with a weight greater 50. AFAIK, that requires an O(N) operation over your database, where N is the number of edges in the DB. You could iterate over all edges in your DB, filter based on their weight, and then cluster from there.
Using Gremlin/Groovy, you can do this pretty easily. Check out the Gremlin docs.
Another approach might be some sort of iterative clustering as you insert your data. It occurs to me that this could be a significant real-word performance improvement, though it doesn't get around the O(N) necessity.
Maybe something like http://tinyurl.com/c8tbth4 is applicable here?
start a=node(*) match p=a-[r]-()-[t]-() where r.percent>50 AND t.percent>50 return p, r.percent, t.percent