Neo4j: Inserting 7k nodes is slow (Spring Data Neo4j / SpringRestGraphDatabase) - neo4j

I'm building an application where my users can manage dictionaries. One feature is uploading a file to initialize or update the dictionary's content.
The part of the structure I'm focusing on for a start is Dictionary -[:CONTAINS]->Word.
Starting from an empty database (Neo4j 1.9.4, but also tried 2.0.0M5), accessed via Spring Data Neo4j 2.3.1 in a distributed environment (therefore using SpringRestGraphDatabase, but testing with localhost), I'm trying to load 7k words in 1 dictionary. However I can't get it done in less than 8/9 minutes on a linux with core i7, 8Gb RAM and SSD drive (ulimit raised to 40000).
I've read lots of posts about loading/inserting performance using REST and I've tried to apply the advices I found but without better luck. The BatchInserter tool doesn't seem to be a good option to me due to my application constraints.
Can I hope to load 10k nodes in a matter of seconds rather than minutes ?
Here is the code I came up with, after all my readings :
Map<String, Object> dicProps = new HashMap<String, Object>();
dicProps.put("locale", locale);
dicProps.put("category", category);
Dictionary dictionary = template.createNodeAs(Dictionary.class, dicProps);
Map<String, Object> wordProps = new HashMap<String, Object>();
Set<Word> words = readFile(filename);
for (Word gw : words) {
wordProps.put("txt", gw.getTxt());
Word w = template.createNodeAs(Word.class, wordProps);
template.createRelationshipBetween(dictionary, w, Contains.class, "CONTAINS", true);
}

I resolve such problem by just creating some CSV file and after that read it from Neo4j. It is needed to make such steps:
Write some class which get input data and base on it create CSV file (it can be one file per node kind or even you can create file which will be used to build relation).
In my case I have also create servlet which allow Neo4j to read that file by HTTP.
Create proper Cypher statements which allow to read and parse that CSV file. There are some samples of which I use (if you use Spring Data also remember about labels):
simple one:
load csv with headers from {fileUrl} as line
merge (:UserProfile:_UserProfile {email: line.email})
more complicated:
load csv with headers from {fileUrl} as line
match (c:Calendar {calendarId: line.calendarId})
merge (a:Activity:_Activity {eventId: line.eventId})
on create set a.eventSummary = line.eventSummary,
a.eventDescription = line.eventDescription,
a.eventStartDateTime = toInt(line.eventStartDateTime),
a.eventEndDateTime = toInt(line.eventEndDateTime),
a.eventCreated = toInt(line.eventCreated),
a.recurringId = line.recurringId
merge (a)-[r:EXPORTED_FROM]->c
return count(r)

Try the below
Usw native Neo4j API rather than spring-data-neo4j while performing batch operations.
Commit in batches i.e. may be for each 500 words
NOTE: There are certain properties (type) added by SDN which will be missing when using the native approach.

Related

databricks: reading from datawarehouse temp directory

I am writing a function like the following
def fromdw():
df=spark.read \
.format("com.databricks.spark.sqldw")\
.option("url", myurl) \
.option("query",sqlquery)\
.option( "forward_spark_azure_storage_credentials","True")\
.option("tempdir", mytempurl)\
.load()
return df
may I know the option tempdir is compulsory? i want to do a read without tempdir , because it was slow as it require to stage the results in a Blob folder first.
Yes. Use JDBC for smaller, faster queries. Use the Databricks spark connector for SQL DW (Synape) for moving large data sets. This connector bypasses both the Spark Driver node and the Synapse head node, and by using Azure Storage as a staging area, allows the unload and load to be performed in parallel by the clusters in either direction.
JDBC is documented here, and looks like this:
connectionProperties = {
"user" : user,
"password" : pw,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
pushdown_query = "(select * from whatever) q"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
(notice how you have to wrap the SELECT to use a query instead of a table name).

In pyspark, reading csv files gets failed if even 1 path does not exist. How can we avoid this?

In pyspark reading csv files from different paths gets failed if even one path does not exist.
Logs = spark.read.load(Logpaths, format="csv", schema=logsSchema, header="true", mode="DROPMALFORMED");
Here Logpaths is an array that contain multiple paths. And these paths are created dynamically depending upon given startDate and endDate range. If Logpaths contain 5 paths and first 3 exists but 4th does not exist. Then whole extraction gets failed. How can I avoid this in pyspark or how can I check there existance before reading?
In scala I did this by checking file existance and filter out non-existed records by using hadoop hdfs filesystem globStatus function.
Path = '/bilal/2018.12.16/logs.csv'
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
val fileStatus = fs.globStatus(new org.apache.hadoop.fs.Path(Path));
So I got what I was looking for. Like the code I posted in the question which can be used in scala for file existance check. We can use below code in case of PySpark.
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("bilal/logs/log.csv"))
This is exactly the same code also used in scala, so in this case we are using java library for hadoop and java code runs on JVM on which spark is running.

cypher statement returns (no changes, no rows)

I've watched Nicole White's awesome youtube “Using LOAD CSV in the Real World” and decided to re-create the neo4j data using the same method.
I’ve cloned her git repo on this subject and have been working this example on the community version of neo4j on my Mac.
I’m stepping thru the load.cql file one command at a time pasting each command into the command window.
Things are going pretty good- I’ve got a bunch of nodes created. To deal with
null values for sub_products and sub_issues in the master file, I created
two other csv files: sub_issues.csv and sub_products.csv as described in the video.
But when I try reading ether these files, I get "(no changes, no rows)”
somehow I get the impression there is something wrong…
below is the actual command sequence I used for the incremental read.
// Load.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:///Volumes/microSD/neo4j-complaints/sub_issue.csv' AS line
WITH line
WHERE line.`Sub-issue` <> '' AND
line.`Sub-issue` IS NOT NULL
MATCH (complaint:Complaint { id: TOINT(line.`Complaint ID`) })
MATCH (complaint)-[:WITH]->(issue:Issue)
MERGE (subIssue:SubIssue { name: UPPER(line.`Sub-issue`) })
MERGE (subIssue)-[:IN_CATEGORY]->(issue)
CREATE (complaint)-[:WITH]->(subIssue)
;
Strip out some of the later statements and do a "RETURN identifier1, identifier2" etc. to see what the engine is doing.

Neo4j Java VM Tuning (v2.3 Community)

From what I can tell I'm having an issue with my Neo4j v2.3 Community Java VM adding items to the Old Gen Heap and never being able to Garbage Collecting them.
Here is a detailed outline of the situation.
I have a PHP file which calls the Dropbox Delta API and writes out the file structure to my Neo4j Database. Each call to Delta returns a 2000 Item data sets of which I pull out the information I need, the following is an example of what that query looks like with just one item, usually I send in full batches of 2000 items as it gave me the best results.
***Following is an example Query***
MERGE (c:Cloud { type:'Dropbox', id_user:'15', id_account:''})
WITH c
UNWIND [
{ parent_shared_folder_id:488417928, rev:'15e1d1caa88',.......}
]
AS items MERGE (i:Item { id:items.path, id_account:'', id_user:'15', type:'Dropbox' })
ON Create SET i = { id:items.path, id_account:'', id_user:'15', is_dir:items.is_dir, name:items.name, description:items.description, size:items.size, created_at:items.created_at, modified:items.modified, processed:1446769779, type:'Dropbox'}
ON Match SET i+= { id:items.path, id_account:'', id_user:'15', is_dir:items.is_dir, name:items.name, description:items.description, size:items.size, created_at:items.created_at, modified:items.modified, processed:1446769779, type:'Dropbox'}
MERGE (p:Item {id_user:'15', id:items.parentPath, id_account:'', type:'Dropbox'})
MERGE (p)-[:Contains]->(i)
MERGE (c)-[:Owns]->(i)
***The query is sent via Everyman****
static function makeQuery($client, $qry) {
return new Everyman\Neo4j\Cypher\Query($client, $qry);
}
This works fine and generally from start to finish takes 8-10 seconds to run.
The Dropbox account I am accessing contains around 35000 items, and takes around 18 runs of my PHP to populate my Neo4j Database with the folder/file structure of the dropbox account.
With every run of this PHP, around 50mb of items are added to the Neo4j JVM Old Gen heap, 30mb of that is not removed by GC.
The end result is obviously the VM runs out of memory and gets stuck in a constant state of GC throttling.
I have tried a range of Neo4j VM settings, as well as an update from Neo4j v2.2.5 to v2.3, which actually has appeared to make the problem worse.
My current settings are as follows,
-server
-Xms4096m
-Xmx4096m
-XX:NewSize=3072m
-XX:MaxNewSize=3072m
-XX:SurvivorRatio=1
I am testing on a windows 10 PC with 8GB of ram and an i5 2.5GHz quad core. Java 1.8.0_60
Any info on how to solve this issue would be greatly appreciated.
Cheers, Jack.
Reduce the new size to 1024m
change your settings to:
-server
-Xms4096m
-Xmx4096m
-XX:NewSize=1024m
It is most likely that the size of your tx grows too large.
I recommend sending each of the parents in separately, so instead of the UNWIND sent one statement each.
Make sure to use the new transactional http endpoint, I recommend to go wit neoclient instead of Neo4jPHP
You should also use parameters instead of literal values!!!
And don't repeeat user-id and type etc. properties on every item.
Are you sure you want to connect everything to c not just the root of the directory structure? I would do the latter.
MERGE (c:Cloud:Dropbox { id_user:{userId}})
MERGE (p:Item:Dropbox {id:{parentPath}})
// owning the parent should be good enough
MERGE (c)-[:Owns]->(p)
WITH c
UNWIND {items} as item
MERGE (i:Item:Dropbox { id:item.path})
ON Create SET i += { is_dir:item.is_dir, name:item.name, created_at:item.created_at }
SET i += { description:item.description, size:item.size, modified:items.modified, processed:timestamp()}
MERGE (p)-[:Contains]->(i);
Make sure to use 2.3.0 for best MERGE performance for relationships.

Updating document in solr cloud 4.4

I have a solr cloud 4.4 set up with two shards.The shards are in two different machine. I have successfully indexed documents using CloudSolrServer of solr client 4.2.0. However while updating the documents some of the documents are getting updated while others are not getting updated and the exception is "org.apache.solr.common.SolrException: No active slice servicing hash code f02c79de in DocCollection(myindex)={" where "myindex" is the name of my index.
The code for updating the documents is as follows:
CloudSolrServer server = new CloudSolrServer("myip:9983");
server.setDefaultCollection("myindex");
List<SolrInputDocument> solrInputDocsList = new ArrayList<SolrInputDocument>();
SolrInputDocument solrInputDoc = new SolrInputDocument();
List<String> servicesList = new ArrayList<String>();
servicesList.add("hockey1");
servicesList.add("blogger1");
Map<String, List<String>> services = new HashMap<String, List<String>>();
services.put("set", servicesList);
solrInputDoc.addField("services",services);
solrInputDoc.addField("id",id);
solrInputDocsList.add(solrInputDoc);
server.add(solrInputDocsList);
server.commit();
My Question are:
1) When document are indexed how it is decides which document goes to which shard.
2) Why the update for some of the documents are failing.
For question 1, check this article
http://searchhub.org/2013/06/13/solr-cloud-document-routing/
The idea is that you can tell SolrCloud to put some documents together by generating the Id field following this pattern
shard_key!document_id
Everything that shares the shard_key will be placed in the same shard
For question 2, I had the same problem and I fixed it by cleaning the configuration Zookeeper had stored
Regards,
Randall Rosales

Resources