From what I can tell I'm having an issue with my Neo4j v2.3 Community Java VM adding items to the Old Gen Heap and never being able to Garbage Collecting them.
Here is a detailed outline of the situation.
I have a PHP file which calls the Dropbox Delta API and writes out the file structure to my Neo4j Database. Each call to Delta returns a 2000 Item data sets of which I pull out the information I need, the following is an example of what that query looks like with just one item, usually I send in full batches of 2000 items as it gave me the best results.
***Following is an example Query***
MERGE (c:Cloud { type:'Dropbox', id_user:'15', id_account:''})
WITH c
UNWIND [
{ parent_shared_folder_id:488417928, rev:'15e1d1caa88',.......}
]
AS items MERGE (i:Item { id:items.path, id_account:'', id_user:'15', type:'Dropbox' })
ON Create SET i = { id:items.path, id_account:'', id_user:'15', is_dir:items.is_dir, name:items.name, description:items.description, size:items.size, created_at:items.created_at, modified:items.modified, processed:1446769779, type:'Dropbox'}
ON Match SET i+= { id:items.path, id_account:'', id_user:'15', is_dir:items.is_dir, name:items.name, description:items.description, size:items.size, created_at:items.created_at, modified:items.modified, processed:1446769779, type:'Dropbox'}
MERGE (p:Item {id_user:'15', id:items.parentPath, id_account:'', type:'Dropbox'})
MERGE (p)-[:Contains]->(i)
MERGE (c)-[:Owns]->(i)
***The query is sent via Everyman****
static function makeQuery($client, $qry) {
return new Everyman\Neo4j\Cypher\Query($client, $qry);
}
This works fine and generally from start to finish takes 8-10 seconds to run.
The Dropbox account I am accessing contains around 35000 items, and takes around 18 runs of my PHP to populate my Neo4j Database with the folder/file structure of the dropbox account.
With every run of this PHP, around 50mb of items are added to the Neo4j JVM Old Gen heap, 30mb of that is not removed by GC.
The end result is obviously the VM runs out of memory and gets stuck in a constant state of GC throttling.
I have tried a range of Neo4j VM settings, as well as an update from Neo4j v2.2.5 to v2.3, which actually has appeared to make the problem worse.
My current settings are as follows,
-server
-Xms4096m
-Xmx4096m
-XX:NewSize=3072m
-XX:MaxNewSize=3072m
-XX:SurvivorRatio=1
I am testing on a windows 10 PC with 8GB of ram and an i5 2.5GHz quad core. Java 1.8.0_60
Any info on how to solve this issue would be greatly appreciated.
Cheers, Jack.
Reduce the new size to 1024m
change your settings to:
-server
-Xms4096m
-Xmx4096m
-XX:NewSize=1024m
It is most likely that the size of your tx grows too large.
I recommend sending each of the parents in separately, so instead of the UNWIND sent one statement each.
Make sure to use the new transactional http endpoint, I recommend to go wit neoclient instead of Neo4jPHP
You should also use parameters instead of literal values!!!
And don't repeeat user-id and type etc. properties on every item.
Are you sure you want to connect everything to c not just the root of the directory structure? I would do the latter.
MERGE (c:Cloud:Dropbox { id_user:{userId}})
MERGE (p:Item:Dropbox {id:{parentPath}})
// owning the parent should be good enough
MERGE (c)-[:Owns]->(p)
WITH c
UNWIND {items} as item
MERGE (i:Item:Dropbox { id:item.path})
ON Create SET i += { is_dir:item.is_dir, name:item.name, created_at:item.created_at }
SET i += { description:item.description, size:item.size, modified:items.modified, processed:timestamp()}
MERGE (p)-[:Contains]->(i);
Make sure to use 2.3.0 for best MERGE performance for relationships.
Related
I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this:
from dask.distributed import Client
client = Client(scheduler_ip)
load_data_future = client.submit(load_data_func, 'path/to/data/')
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Running this as above the scheduler gets one worker to read the file, then it spills that data to disk to share it with the other workers. However, loading the data is usually reading from a large HDF5 file, which can be done concurrently, so I was wondering if there was a way to force all workers to read this file concurrently (they all compute the root task) instead of having them wait for one worker to finish then slowly transferring the data from that worker.
I know there is the client.run() method which I can use to get all workers to read the file concurrently, but how would you then get the data you've read to feed into the downstream tasks?
I cannot use the dask data primitives to concurrently read HDF5 files because I need things like multi-indexes and grouping on multiple columns.
Revisited this question and found a relatively simple solution, though it uses internal API methods and involves a blocking call to client.run(). Using the same variables as in the question:
from distributed import get_worker
client_id = client.id
def load_dataset():
worker = get_worker()
data = {'load_dataset-0': load_data_func('path/to/data')}
info = worker.update_data(data=data, report=False)
worker.scheduler.update_data(who_has={key: [worker.address] for key in data},
nbytes=info['nbytes'], client=client_id)
client.run(load_dataset)
Now if you run client.has_what() you should see that each worker holds the key load_dataset-0. To use this in downstream computations you can simply create a future for the key:
from distributed import Future
load_data_future = Future('load_dataset-0', client=client)
and this can be used with client.compute() or dask.delayed as usual. Indeed the final line from the example in the question would work fine:
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Bear in mind that it uses internal API methods Worker.update_data and Scheduler.update_data and works fine as of distributed.__version__ == 1.21.6 but could be subject to change in future releases.
As of today (distributed.__version__ == 1.20.2) what you ask for is not possible. The closest thing would be to compute once and then replicate the data explicitly
future = client.submit(load, path)
wait(future)
client.replicate(future)
You may want to raise this as a feature request at https://github.com/dask/distributed/issues/new
I've watched Nicole White's awesome youtube “Using LOAD CSV in the Real World” and decided to re-create the neo4j data using the same method.
I’ve cloned her git repo on this subject and have been working this example on the community version of neo4j on my Mac.
I’m stepping thru the load.cql file one command at a time pasting each command into the command window.
Things are going pretty good- I’ve got a bunch of nodes created. To deal with
null values for sub_products and sub_issues in the master file, I created
two other csv files: sub_issues.csv and sub_products.csv as described in the video.
But when I try reading ether these files, I get "(no changes, no rows)”
somehow I get the impression there is something wrong…
below is the actual command sequence I used for the incremental read.
// Load.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:///Volumes/microSD/neo4j-complaints/sub_issue.csv' AS line
WITH line
WHERE line.`Sub-issue` <> '' AND
line.`Sub-issue` IS NOT NULL
MATCH (complaint:Complaint { id: TOINT(line.`Complaint ID`) })
MATCH (complaint)-[:WITH]->(issue:Issue)
MERGE (subIssue:SubIssue { name: UPPER(line.`Sub-issue`) })
MERGE (subIssue)-[:IN_CATEGORY]->(issue)
CREATE (complaint)-[:WITH]->(subIssue)
;
Strip out some of the later statements and do a "RETURN identifier1, identifier2" etc. to see what the engine is doing.
I have an XQuery query intended to wipe test documents from the database before each test that is run. Essentially it looks for a certain element to be present as a top level element in a document (called 'forTestOnly') and if it finds it it deletes the document. This query is run before each test in order to ensure the tests don't interfere with one another (we have about 200 tests using this). The exact XQuery is as such:
xquery version "1.0-ml";
import module namespace dls = "http://marklogic.com/xdmp/dls" at "/MarkLogic/dls.xqy";
let $deleteNonManagedDocs := for $testDoc in /*[forTestOnly]
let $testDocUri := fn:base-uri($testDoc)
where fn:not(dls:document-is-managed($testDocUri))
return xdmp:document-delete($testDocUri)
let $deleteManagedDocs := for $testDoc in cts:search(/*[forTestOnly], dls:documents-query())
let $testDocUri := fn:base-uri($testDoc)
return dls:document-delete($testDocUri, fn:false(), fn:false())
return ($deleteManagedDocs, $deleteNonManagedDocs)
While it seems to work fine most of the time, it recently has begun to sporadically spiral out of control. At some point during the test execution it begins to run for a near indefinite amount of time (I usually stop it after 600-700 seconds), most of the time it takes less than a second though. The database used for testing is not large (it has a few basic seed documents but nothing compared to a production database), and typically each test only creates a handful of documents with the 'forTestOnly' (if not less).
The query seems simple enough, and although running it 200 times in relatively quick succession would understandably put a strain on the database I can't imagine it would cause this kind of lagging (the tests are Grails integration tests and the entire execution takes a little over two minutes). Any ideas why the long run time?
As a side note I have verified that when the tests stall it is indeed after the XQuery has begun to run and not before in some sort of test wiring/execution.
Any help is greatly appreciated.
The query might look simple, but it isn't necessarily simple to evaluate. Those dls function calls could be doing anything, so it's tricky to estimate the complexity. The use of DLS also means that we don't know how much version history has to be deleted to delete each document.
One possibility is that you've discovered a bug. It might already be fixed, which is a good reason why you should always report the full version of the software you're using. The answer may be as simple as upgrading to pick up the fix.
Another possibility is that your test suite ends up running all of this work in a single high-level evaluation, so everything's in memory until the end. That could use enough memory to drive the server into swap. That would explain the recent "spiral out of control" behavior. Check the OS and see what it says.
Next, set the group file-log-level=Debug and check ErrorLog.txt while one of these slow events is happening. If you see XDMP-DEADLOCK messages, you may have a problem where two or more copies of this delete query are running at the same time. MarkLogic has automatic deadlock detection and resolution, but it's faster to avoid the deadlock in the first place.
Some logging might also help determine where the time is spent. Something like:
let $deleteNonManagedDocs := for $testDoc in /*[forTestOnly]
let $testDocUri := fn:base-uri($testDoc)
where fn:not(dls:document-is-managed($testDocUri))
return (
xdmp:log(text { 'unmanaged', $testDocUri }),
xdmp:document-delete($testDocUri))
let $deleteManagedDocs := for $testDoc in cts:search(/*[forTestOnly], dls:documents-query())
let $testDocUri := fn:base-uri($testDoc)
let $_ := xdmp:log(text { 'managed', $testDocUri })
return dls:document-delete($testDocUri, fn:false(), fn:false())
return ()
Finally you should also be able to simplify the query a bit. Since you're deleting everything, you can just ignore DLS.
xdmp:document-delete(
cts:uris(
(), (),
cts:element-query(xs:QName('forTestOnly'), cts:and-query(())))
This would be even simpler and more efficient if you set a collection on every test document: xdmp:collection-delete('test-docs').
I'm building an application where my users can manage dictionaries. One feature is uploading a file to initialize or update the dictionary's content.
The part of the structure I'm focusing on for a start is Dictionary -[:CONTAINS]->Word.
Starting from an empty database (Neo4j 1.9.4, but also tried 2.0.0M5), accessed via Spring Data Neo4j 2.3.1 in a distributed environment (therefore using SpringRestGraphDatabase, but testing with localhost), I'm trying to load 7k words in 1 dictionary. However I can't get it done in less than 8/9 minutes on a linux with core i7, 8Gb RAM and SSD drive (ulimit raised to 40000).
I've read lots of posts about loading/inserting performance using REST and I've tried to apply the advices I found but without better luck. The BatchInserter tool doesn't seem to be a good option to me due to my application constraints.
Can I hope to load 10k nodes in a matter of seconds rather than minutes ?
Here is the code I came up with, after all my readings :
Map<String, Object> dicProps = new HashMap<String, Object>();
dicProps.put("locale", locale);
dicProps.put("category", category);
Dictionary dictionary = template.createNodeAs(Dictionary.class, dicProps);
Map<String, Object> wordProps = new HashMap<String, Object>();
Set<Word> words = readFile(filename);
for (Word gw : words) {
wordProps.put("txt", gw.getTxt());
Word w = template.createNodeAs(Word.class, wordProps);
template.createRelationshipBetween(dictionary, w, Contains.class, "CONTAINS", true);
}
I resolve such problem by just creating some CSV file and after that read it from Neo4j. It is needed to make such steps:
Write some class which get input data and base on it create CSV file (it can be one file per node kind or even you can create file which will be used to build relation).
In my case I have also create servlet which allow Neo4j to read that file by HTTP.
Create proper Cypher statements which allow to read and parse that CSV file. There are some samples of which I use (if you use Spring Data also remember about labels):
simple one:
load csv with headers from {fileUrl} as line
merge (:UserProfile:_UserProfile {email: line.email})
more complicated:
load csv with headers from {fileUrl} as line
match (c:Calendar {calendarId: line.calendarId})
merge (a:Activity:_Activity {eventId: line.eventId})
on create set a.eventSummary = line.eventSummary,
a.eventDescription = line.eventDescription,
a.eventStartDateTime = toInt(line.eventStartDateTime),
a.eventEndDateTime = toInt(line.eventEndDateTime),
a.eventCreated = toInt(line.eventCreated),
a.recurringId = line.recurringId
merge (a)-[r:EXPORTED_FROM]->c
return count(r)
Try the below
Usw native Neo4j API rather than spring-data-neo4j while performing batch operations.
Commit in batches i.e. may be for each 500 words
NOTE: There are certain properties (type) added by SDN which will be missing when using the native approach.
I'm trying to use the Java API for Neo4j but I seem to be stuck at IndexHits. If I query the DB with Cypher using
START n=node:types(type="Process") RETURN n;
I get all 2087 nodes of type "Process".
In my application I have the following lines
Index<Node> nodeIndex = db.index().forNodes("types");
IndexHits<Node> hits = nodeIndex.get("type", "Process");
System.out.println("Node index size: " + hits.size());
which leads my console to spit out a value of 0. Here, db is of course an instance of GraphDatabaseService.
I expected an object that included all 2087 nodes. What am I doing wrong?
The .size() question is just the prelude to my iterator
for(Node process : hits) { ... }
but that does not much when hits.size() == 0. According to http://api.neo4j.org/1.9.2/org/neo4j/graphdb/index/IndexHits.html this should be possible, provided there is something in hits.
Thanks in advance for your help.
I figured it out. Man, I feel so embarrassed...
It so happens that I had set up the DB_PATH to my default data folder, whereas the default storage folder is the default data folder plus graph.db. When I tried to run the code from that corrected DB_PATH I got an error saying that a lock file was in place because the Neo4j server was running. After shutting it down it worked perfectly.
So, if you happen to see the following error, just stop the server and run the code again:
Caused by: org.neo4j.kernel.StoreLockException: Could not create lock file
at org.neo4j.kernel.StoreLocker.checkLock(StoreLocker.java:74)
at org.neo4j.kernel.StoreLockerLifecycleAdapter.start(StoreLockerLifecycleAdapter.java:40)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:491)
I found on several forums that you cannot run the Neo4j server and use the Java API to query it at the same time.