Neo4j query on modest Graph rather slow - neo4j

I'm toying around with Neo4J. My Data consists of users who own objects which are tagged by tags. There my schema looks like:
(:User)-[:OWNS]->(:Object)-[:TAGGED_AS]->(:Tag)
I have written a script that generates me a sample Graph. Currently I have 100 User, ~2500 Tag and ~10k Object nodes in the database. Between those I have ~700k relationships. I know want to find every Object that is not owned by a certain User but related over a Tag the User has used himself. There query looks like:
MATCH (user:User {username: 'Cristal'})
WITH user
MATCH (user)-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
However, this query runs ~1-5 minutes (depending on the user and how many objects he owns), which is a not only a bit to slow. What am I doing wrong? I consider this a rather "trivial" query against a graph of modest size. I'm using Neo4J 2.1.6 Community and already set the Java Heap to 2000 MB (and I can see that there is a Java process using this much). Am I missing an index or something like that (I'm new to Neo4J)?
I honestly expected the result to be pretty much instant especially considering that the Neo4J docs mention the I should use a heap between 1 and 4 GB for 100 million objects...and I'm only close to a 1/100 of this number.
If it is my Query (which I hope and expect) how can I improve it? What is something you have to be aware when writing queries?

Do you have an index on the username property?
CREATE INDEX ON :User(username)
Also you don't really need that WITH there, so maybe drop it to see if it helps:
MATCH (user:User {username: 'Cristal'})-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
Also, I don't think it will make a different, but you can drop the obj and tag variables since you're not using them elsewhere in the query.
Also, if you're generating sample graphs you may want to check out GraphGen:
http://graphgen.neoxygen.io/

Related

How to determine the Max property on a Relationship in Neo4j 2.2.3

How do you quickly get the maximum (or minimum) value for a property of all instances of a relationship? You can assume the machine I'm running this on is well within the recommended spec's for the cpu and memory size of graph and the heap size is set accordingly.
Facts:
Using Neo4j v2.2.3
Only have access to modify graph via Cypher query language which I'm hitting via PHP or in the web interfacxe--would love to avoid any solution that requires java coding.
I've got a relationship, call it likes that has a single property id that is an integer.
There's about 100 million of these relationships and growing
Every day I grab new likes from a MySQL table to add to the graph within in Neo4j
The relationship property id is actually the primary key (auto incrementing integer) from the raw MySQL table.
I only want to add new likes so before querying MySQL for the new entries I want to get the max id from the likes, so I can use it in my SQL query as SELECT * FROM likes_table WHERE id > max_neo4j_like_property_id
How can I accomplish getting the max id property from neo4j in a optimal way? Please indicate the create statement needed for any index as well as the query you'd used to get the final result.
I've tried creating an index as follows:
CREATE INDEX ON :likes(id);
After the index is online I've tried:
MATCH ()-[r:likes]-() RETURN r.i ORDER BY r.id DESC LIMIT 1
as well as:
MATCH ()-[r:likes]->() RETURN MAX(r.id)
They work but take freaking forever as the explain plan for both indicate no indexes being used.
UPDATE: Holy $?##$?!!!! It looks like the new schema indexes aren't functional for relationships even though you can create them and show them with :schema. It also looks as if there's no way with cypher directly to create Legacy Indexes which look like they might solve this issue.
If you need to query relationship properties, it is generally a sign of a model issue.
The need of this query reveals you that you would better extract these properties into a node, that you'll then be able to query faster.
I don't say it is 100% the case, but certainly 99% of the people seen so far with the same problem has been demonstrating this model concern.
What is your model right now ?
Also you don't use labels at all in your query, likes have a context bound to the nodes.

how to visualize the complete graph in neo4j after importing the data and relationships

I have a question related to the NEO4J
when I create the nodes arround 40,000 and their relationships which are around 20,000 it means, 4K nodes and 6![enter image description here][1]K relationships which are less than 15MB for sure.
when I run a query
"match (n) optional match (n)-[r]-() return n,r: "
it starts to load and after waiting for long time it returns nothing (in graphical form). But in the resultant file it shows how many nodes and relationships I have but no graphs . I want to see the complete graph of my data. is there anyway to see how does it look like, its only to visualize. When I limit the query till 800 it works.
Is there anything I need to change in settings or in my system memory?
any suggestion for that?
The web console isn't very good for more than the hundreds of nodes scale. I'd suggest looking at Gephi:
http://gephi.github.io/
Alternatively you could use Linkurious, an online tool:
https://linkurio.us/
If you want to roll your own there are a number of choices out there. I like Sigma.js:
http://sigmajs.org/
Linkurious also has a library based on Sigma:
https://github.com/Linkurious/linkurious.js
EDIT: http://keylines.com/ is another online service like Linurious

Creating nodes and relationships at the same time in neo4j

I am trying to build an database in Neo4j with a structure that contains seven different types of nodes, in total around 4-5000 nodes and between them around 40000 relationships. The cypher code i am currently using is that i first create the nodes with the code:
Create (node1:type {name:'example1', type:'example2'})
Around 4000 of that example with unique nodes.
Then I've got relationships stated as such:
Create
(node1)-[:r]-(node51),
(node2)-[:r]-(node5),
(node3)-[:r]-(node2);
Around 40000 of such unique relationships.
With smaller scale graphs this has not been any problem at all. But with this one, the Executing query never stops loading.
Any suggestions on how I can make this type of query work? Or what i should do instead?
edit. What I'm trying to build is a big graph over a product, with it's releases, release versions, features etc. in the same way as the Movie graph example is built.
The product has about 6 releases in total, each release has around 20 releaseversion. In total there is 371 features and of there 371 features there is also 438 featureversions. ever releaseversion (120 in total) then has around 2-300 featureversions each. These Featureversions are mapped to its Feature whom has dependencies towards a little bit of everything in the db. I have also involed HW dependencies such as the possible hw to run these Features on, releases on etc. so basicaly im using cypher code such as:
Create (Product1:Product {name:'ABC', type:'Product'})
Create (Release1:Release {name:'12A', type:'Release'})
Create (Release2:Release {name:'13A, type:'release'})
Create (ReleaseVersion1:ReleaseVersion {name:'12.0.1, type:'ReleaseVersion'})
Create (ReleaseVersion2:ReleaseVersion {name:'12.0.2, type:'ReleaseVersion'})
and below those i've structured them up using
Create (Product1)<-[:Is_Version_Of]-(Release1),
(Product1)<-[:Is_Version_Of]-(Release2),
(Release2)<-[:Is_Version_Of]-(ReleaseVersion21),
All the way down to features, and then I've also added dependencies between them such as:
(Feature1)-[:Requires]->(Feature239),
(Feature239)-[:Requires]->(Feature51);
Since i had to find all this information from many different excel-sheets etc, i made the code this way thinking i could just put it together in one mass cypher query and run it on the /browser on the localhost. it worked really good as long as i did not use more than 4-5000 queries at a time. Then it created the entire database in about 5-10 seconds at maximum, but now when I'm trying to run around 45000 queries at the same time it has been running for almost 24 hours, and are still loading and saying "executing query...". I wonder if there is anyway i can improve the time it takes, will the database eventually be created? or can i do some smarter indexes or other things to improve the performance? because by the way my cypher is written now i cannot divide it into pieces since everything in the database has some sort of connection to the product. Do i need to rewrite the code or is there any smooth way around?
You can create multiple nodes and relationships interlinked with a single create statement, like this:
create (a { name: "foo" })-[:HELLO]->(b {name : "bar"}),
(c {name: "Baz"})-[:GOODBYE]->(d {name:"Quux"});
So that's one approach, rather than creating each node individually with a single statement, then each relationship with a single statement.
You can also create multiple relationships from objects by matching first, then creating:
match (a {name: "foo"}), (d {name:"Quux"}) create (a)-[:BLAH]->(d);
Of course you could have multiple match clauses, and multiple create clauses there.
You might try to match a given type of node, and then create all necessary relationships from that type of node. You have enough relationships that this is going to take many queries. Make sure you've indexed the property you're using to match the nodes. As your DB gets big, that's going to be important to permit fast lookup of things you're trying to create new relationships off of.
You haven't specified which query you're running that isn't "stopping loading". Update your question with specifics, and let us know what you've tried, and maybe it's possible to help.
If you have one of the nodes already created then a simple approach would be:
MATCH (n: user {uid: "1"}) CREATE (n) -[r: posted]-> (p: post {pid: "42", title: "Good Night", msg: "Have a nice and peaceful sleep.", author: n.uid});
Here the user node already exists and you have created a new relation and a new post node.
Another interesting approach might be to generate your statements directly in Excel, see http://blog.bruggen.com/2013/05/reloading-my-beergraph-using-in-graph.html?view=sidebar for an example. You can run a lot of CREATE statements in one transaction, so this should not be overly complicated.
If you're able to use the Neo4j 2.1 prerelease milestones, then you should try using the new LOAD CSV and PERIODIC COMMIT features. They are designed for just this kind of use case.
LOAD CSV allows you to describe the structure of your data with one or more Cypher patterns, while providing the values in CSV to avoid duplication.
PERIODIC COMMIT can help make large imports more reliable and also improve performance by reducing the amount of memory that is needed.
It is possible to use a single cypher query to create a new node as well as relate it to an existing now.
As an example, assume you're starting with:
an existing "One" node which has an "id" property "1"
And your goal is to:
create a second node, let's call that "Two", and it should have a property id:"2"
relate the two nodes together
You could achieve that goal using a single Cypher query like this:
MATCH (one:One {id:'1'})
CREATE (one) -[:RELATED_TO]-> (two:Two {id:'2'})

Fastest way to load neo4j DB w/Cypher - how to integrate new subgraphs?

I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.

Neo4j 2.0: Indexing array-valued properties with schema indexing

I have nodes with multiple "sourceIds" in one array-valued property called "sourceIds", just because there could be multiple resources a node could be derived from (I'm assembling multiple databases into one Neo4j model).
I want to be able to look up nodes by any of their source IDs. With legacy indexing this was no problem, I would just add a node to the index associated with each element of the sourceIds property array.
Now I wanted to switch to indexing with labels and I'm wondering how that kind of index works here. I can do
CREATE INDEX ON :<label>(sourceIds)
but what does that actually do? I hoped it would just create index entries for each array element, but that doesn't seem to be the case. With
MATCH n:<label> WHERE "testid" in n.sourceIds RETURN n
the query takes between 300ms and 500ms which is too long for an index lookup (other schema indexes work three to five times faster). With
MATCH n:<label> WHERE n.sourceIds="testid" RETURN n
I don't get a result. That's clear because it's an array property but I just gave it a try since it would make sense if array properties would be broken down to their elements for indexing purposes.
So, is there a way to handle array properties with schema indexing or are there plans or will I just have to stick to legacy indexing here? My problem with the legacy Lucene index was that I hit the max number of boolean clauses (1024). Another question thus would be: Can I raise this number? Lucene allows that, but can I do this with the Lucene index used by Neo4j?
Thanks and best regards!
Edit: A bit more elaboration on why I hit the boolean clauses max limit: I need to export specific parts of the database into custom file formats for text processing pipelines. These pipelines use components I cannot (be it for the sake of accessibility or time) change to query Neo4j directly, so I'd rather stay with the defined required file format(s). I do the export via the pattern "give me all IDs in the DB; now, for batches of IDs, query the desired information (e.g. specific paths) from Neo4j and store the results to file". Why I use batches at all? Well, if I don't, things are slowed down significantly via the connection overhead. Thus, large batches are a kind of optimization here.
Schema indexes can only do exact matches right now. Your "testid" in n.sourceIds does not use the index (as shown by your query times). I think there are plans to make this behave better, but I'm waiting for them just as eagerly as you are.
I've actually hit a lower max in the lucene query: 512. If there is a way to increase it I'd love to hear of it. The way I got around it is just doing more than one query if I have one of the rare cases that actually goes over 512 ids. What query are you doing where you need more?

Resources