I am new to NEO4J but have been working with MySQL for many years. Now I have created a database with 700 000 user, 800 000 cookbooks and 1,6M saved recipes i NEO4J.
The structure of the nodes are like this (:User)-[:CREATED]-(:Cookbook)-[:SAVED]-(:Recipe). All the users and recipes are unique, but one user can have multipel cookbooks and every cookbook can have multipel recipes.
I use a EC2 m3.x2large, so it is quite fast. But the performance is very bad. This query:
MATCH (r:Recipe{recipe_id:2987431}) return r;
Take between 300-500 ms and mysql can execute it in around 2 ms.
Is this usual or have I configured the server all wrong?
(I have an index on :Recipe(recipe_id) )
Has your index come online yet? If you run :schema in the console it should list all of the constraints / indexes and if they've yet been fully scanned and are online and available for use.
Related
Our py2neo script ingests abstracts at a rate of about 500,000 a day with Neo4J. For comparison, we ingest 20 million of these abstracts in Solr in one day. We're wondering if this is the expected rate of ingestion for Neo4J or if there is something we can do to increase performance?
We've tried combinations of py2neo version 2 and version 3 and Neo4J Enterprise version 2 and 3. With each combination, the ingestion rate remains about the same. We use batches of 1000 abstracts to increase performance. The abstracts average about 400-500 words, we create 5 additional entities with modest properties then create a relationship between each abstract and the entities. We first ingest the entities and then the relationships (create_unique()) to avoid round trips to the server (no find() or find_one()). We prefer merge() over create() to ensure only one node is created per abstract. We did try create() and the load performance only improved slightly. The bottleneck appears to be on the server side. Our script will create the 1000 transactions quickly, then there is an extended delay during the commit, suggesting any slowdown is from Neo4J server while it processes the transaction.
We require a solution that does not wipe the entire Neo4J database. We intend to ingest multiple data streams in parallel in the future so the DB must remain stable.
We prefer Python over Java and prefer py2neo's merge()/create() based transactions over direct Cypher queries.
We were hoping Bolt would give us better performance, but currently a Bolt transaction hangs indefinitely with py2neo v3 / Neo4J 3.0.0 RC1. We also had one instance of the HTTP transaction hanging as well.
Our Neo4J instances use the default configuration.
Our server is a 2 processor, 12 core, Linux host with 32GB of memory.
Any suggestions on how to increase load performance? It would be grand if we could ingest 20 million abstracts into Neo4J in just a few days.
Our ingestion script shows a transaction rate of 54 entity transactions per second. Note that's 54, not 54K:
$ python3 neo-ingestion-rate.py
Number of batches: 8
Entity transactions per batch: 6144
Merge entities: 2016-04-22 16:31:50.599126
All entities committed: 2016-04-22 16:47:08.480335
Entity transactions per second: 53.5494121750082
Relationship transactions per batch: 5120
Merge unique relationships: 2016-04-22 16:47:08.480408
All relationships committed: 2016-04-22 16:49:38.102694
Number of transactions: 40960
Relationship transactions per second: 273.75593641599323
Thanks.
How about loading via neo4j-shell? I do the majority of my work in R and simply script the import.
Here is a blog post where I outline the approach. You could mirror it in Python.
The basic idea is take your data, save it to disk, and load via neo4j-shell where you execute cypher scripts that reference those files.
I have found this approach to be helpful when loading larger sets of data. But of course, it all depends on the density of your data, the data model itself, and having the appropriate indexes established.
This blog post explains how to import data in bulk:
https://neo4j.com/blog/bulk-data-import-neo4j-3-0/
They claim being able to import ~31M nodes, ~78M relationships in ~3min
They just don't mention the machine this is running on, most likely a cluster.
Still, it shows it should be possible to get much much higher ingestion rate than what you observe.
The Python class likely import one record at a time, when you really want to do bulk inserts.
I'm toying around with Neo4J. My Data consists of users who own objects which are tagged by tags. There my schema looks like:
(:User)-[:OWNS]->(:Object)-[:TAGGED_AS]->(:Tag)
I have written a script that generates me a sample Graph. Currently I have 100 User, ~2500 Tag and ~10k Object nodes in the database. Between those I have ~700k relationships. I know want to find every Object that is not owned by a certain User but related over a Tag the User has used himself. There query looks like:
MATCH (user:User {username: 'Cristal'})
WITH user
MATCH (user)-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
However, this query runs ~1-5 minutes (depending on the user and how many objects he owns), which is a not only a bit to slow. What am I doing wrong? I consider this a rather "trivial" query against a graph of modest size. I'm using Neo4J 2.1.6 Community and already set the Java Heap to 2000 MB (and I can see that there is a Java process using this much). Am I missing an index or something like that (I'm new to Neo4J)?
I honestly expected the result to be pretty much instant especially considering that the Neo4J docs mention the I should use a heap between 1 and 4 GB for 100 million objects...and I'm only close to a 1/100 of this number.
If it is my Query (which I hope and expect) how can I improve it? What is something you have to be aware when writing queries?
Do you have an index on the username property?
CREATE INDEX ON :User(username)
Also you don't really need that WITH there, so maybe drop it to see if it helps:
MATCH (user:User {username: 'Cristal'})-[:OWNS]->(obj:Object)-[:TAGGED_AS]->(tag:Tag)<-[:TAGGED_AS]-(other:Object)
WHERE NOT (user)-[:OWNS]->(other)
RETURN other
LIMIT 20
Also, I don't think it will make a different, but you can drop the obj and tag variables since you're not using them elsewhere in the query.
Also, if you're generating sample graphs you may want to check out GraphGen:
http://graphgen.neoxygen.io/
I'm new to Neo4j CYPHER query language. I'm discovering it, while analyzing a graph of person to person relationships coming from a CRM system. I'm using Neo4j 2.1.2 Community Edition with Oracle Java JDK 1.7.0_45 on Windows 7 Enterprise, and interacting with Neo4j thru the web interface.
One thing puzzles me: I noticed that the resultset of some of my queries do grow over time, that is, if I run the same query after having used the database for quite a long time (1 or 2 hours later), I get a bit more results the second time -- having not updated, deleted or added anything to the database.
Is that possible? Are there special cases where it could happen? I would expect the database results to be consistent over time, as long as there is no change to the database.
I feel it is, as if the database was growing its indexes over time in the background, and if the query results were depending on the database engine's ability to reach more nodes and relationships thru the grown indexes. Could it be a memory or index configuration issue? Or did I possibly got to much coffee? Alas, it is not easily reproductible.
Sample query:
MATCH (pf:Portfolio)<-[:withRelation]-(p1:Partner)-[:JOINTACC]->(p2:Partner)
WHERE (pf.dateBoucl = '') AND (pf.catClient = 'NO')
AND NOT (p2)-[:relTo]->(:Partner)
MATCH (p1)-[r]->(p3:Partner)
WHERE NOT (p3)-[:relTo]->(:Partner)
AND NOT TYPE( r) IN [ 'relTo', 'ADRESSAT', 'MEMBER']
WITH pf, p1, p2, COLLECT( TYPE( r)) AS types
WHERE ALL( t IN types WHERE t = 'JOINTACC')
RETURN pf.catClient, pf.natureTitulaire, COUNT( DISTINCT pf);
At first I got 98 results. When running it 2 hours later, I get 103 results, and then it seems stable for subsequent runs. And I'm pretty sure I did not change the database contents.
Any hints very appreciated! Kind regards
Schema looks like this:
:schema
Indexes
ON :Country(ID) ONLINE (for uniqueness constraint)
ON :Partner(partnerID) ONLINE (for uniqueness constraint)
ON :Portfolio(partnerID) ONLINE
ON :Portfolio(noCli) ONLINE
ON :Portfolio(noDos) ONLINE
Constraints
ON (partner:Partner) ASSERT partner.partnerID IS UNIQUE
ON (country:Country) ASSERT country.ID IS UNIQUE
Dump / download your query results from both runs and do a diff on them. Then you see what differs and you can investigate where it came from.
Perhaps you also should update to 2.1.3 which has one caching problem resolved that could be related to this.
I am new to using Neo4j and have setup a test graph db in neo4j for organizing some click stream data with a very small subset of what we actually use on a day to day basis. This graph has about 23 million nodes and 34 million relationships. The queries seem to be taking forever to run i.e. I haven't seen the response come back even after waiting for more than 30 mins.
The data is organized as Year->Month->Day->Session{1..n}->Event{1..n}
I am running the db on a Windows 7 machine with 1.5 gb of heap allocated to Neo4j server
These are the configurations in the neo4j-wrapper.conf
wrapper.java.additional.1=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional.2=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional.3=-Dlog4j.configuration=file:conf/log4j.properties
wrapper.java.additional.6=-XX:+UseParNewGC
wrapper.java.additional.7=-XX:+UseConcMarkSweepGC
wrapper.java.additional.8=-Xloggc:data/log/neo4j-gc.log
wrapper.java.initmemory=1500
wrapper.java.maxmemory=1500
This is what my query looks like
START n=node(3)
MATCH (n)-[:HAS]->(s)
WITH distinct s
MATCH (s)-[:HAS]->(e) WHERE e.page_name = 'Login'
WITH s.session_id as session, e
MATCH (e)-[:FOLLOWEDBY*0..1]->(e1)
WITH count(session) as session_cnt, e.page_name as startPage, e1.page_name as nextPage
RETURN startPage, nextPage, session_cnt
Also i have these properties set
node_auto_indexing=true
node_keys_indexable=name,page_name,geo_country
relationship_auto_indexing=true
Can anyone help me to figure out what might be wrong.
Even when I run portions of the query it takes 10-15 minutes before I can see a response.
Note: I have no other applications running on the Windows Machine
Why would you want to return all the nodes in the first place?
If you really want to do that, use the transactional http endpoint and curl to stream the response:
I tested it with a database of 100k nodes. It takes 0.9 seconds to transfer them (1.5MB) over the wire.
If you transfer all their properties by using "return n", it takes 1.4 seconds and results in 4.1MB transferred.
If you just want to know how many nodes are in your db. use something like this instead:
match (n) return count(*);
I own a ~1 million records MySQL table.
I will need soon to add search in my Rails 3.x app. I want the search to be fuzzy.
Actually, I use a plugin (rails-fuzzy-search) for another table but it's only 3000 records.
This plugin create trigrams in another table (25000 trigrams for the 3000 records table).
Well, I can't use this method for my 1 million records table else my trigrams table will be maybe 100 millions records !
I see some gems:
https://github.com/seamusabshere/fuzzy_match
https://github.com/kiyoka/fuzzy-string-match
Or the use of Sphinx and Thinking Sphinx + addons.
I don't know what is the best solution for better performances.
The search will be set for two fields of my table.
some searching around revealed fuzzily gem:
Anecdotical benchmark: against our whole Geonames-derived table of
locations (3.2M records, about 1GB of data), on my development machine
(a 2011 MacBook Pro)
searching for the top 10 matching records takes 6ms ±1 preparing the
index for all records takes about 10min the DB query overhead when
changing a record is at 3ms ±2 the memory overhead (footprint of the
trigrams table index) is about 300MB
Also, check out Solr and Sunspot