Virtuoso and Jena: Large RDF graphs loading issue - jena

I have a 200GB RDF file in .nt format. I want to load it in Virtuoso (using Virtuoso Open-Source Edition 6.1.6). I used Virtuoso bulk loader from command line but loading gets hang after couple of hours of running. Do you have any idea how I can load this large file to Virtuoso efficiently? I want to load it fast.
I also tried to query my 200GB RDF graph from Apache Jena. However after running for 30 minutes it gives me some heap size space related error. If you have any solution for the above problem then kindly let me know.

Jena TDB has a bulk loader which has been used on large data input (hundred's of millions of triples).

What is the actual dataset you are loading? Is it actually just one file? We would recommend splitting into files of about 1GB max, and loading multiple files at a time with the bulk loader.
Have you done any performance tuning of the Virtuoso Server for the resources available on the machine in use, as detailed in the RDF Performance Tuning guide?
Please check with the status(''); command how many buffers are in use as, if you run out during a load, you will be swapping to disk continuously, which will lead to the sort of apparent hangs you report.
Note you can also load the Virtuoso LD Meter functions to monitor the progress of the dataset loads.

Related

dask scheduler OOM opening a large number of avro files

I'm trying to run a pipeline via dask on a cluster on gcp. The pipeline loads a lot of avro files from cloud storage (~5300 files with around 300MB each) like this
bag = db.read_avro(
'gcs://mybucket/myfiles-*.avro',
blocksize=5000000
)
It then applies some transformations and saves the data back to cloud storage (as parquet files).
I've tested this pipeline with a fraction of the avro files and it works perfectly, but when I tell it to ingest all the files, the scheduler process sits at 100% CPU for a long time and at some point it runs out of memory (I have tried scaling my master node running the scheduler up to 64GB of RAM but that still does not suffice), while the workers are idling. I assume that the problem is that it has to create an excessive amount of tasks that are all held in RAM before being distributed to the workers.
Is this some sort of antipattern that I'm using when trying to open a very large number of files? If so, is there perhaps a built-in way to better cope with this or would I have to split the avro files manually?
Avro with Dask at scale is not particularly well-trodden territory. There is no theoretical reason it should not work. You could inspect the contents of the graph to see if things are getting serialised there that are large, or if simply a massive number of tasks are being generated. If the former, it may be solvable, and you could raise an issue.
As you say, you may be able to keep the load on the scheduler down by processing sub-batches out of the total set of files at a time and waiting for completion.

Debugging slow reads from BigQuery on Google Cloud Dataflow

Background:
We have a really simple pipeline which reads some data from BigQuery (usually ~300MB) filters/transforms it and puts it back to BigQuery. in 99% of cases this pipeline finishes in 7-10minutes and is then restarted again to process a new batch.
Problem:
Recently, the job has started to take >3h once in a while, maybe 2 times in a month out of 2000 runs. When I look at the logs, I can't see any errors and in fact it's only the first step (read from BigQuery) that is taking so long.
Does anyone have a suggestion on how to approach debugging of such cases? Especially since it's really the read from BQ and not any of our transformation code. We are using Apache Beam SDK for Python 0.6.0 (maybe that's the reason!?)
Is it maybe possible to define a timeout for the job?
This is an issue on either Dataflow side or BigQuery side depending on how one looks at it. When splitting the data for parallel processing, Dataflow relies on an estimate of the data size. The long runtime happens when BigQuery sporadically gives a severe under-estimate of the query result size, and Dataflow, as a consequence, severely over-splits the data and the runtime becomes bottlenecked by the overhead of reading lots and lots of tiny file chunks exported by BigQuery.
On one hand, this is the first time I've seen BigQuery produce such dramatically incorrect query result size estimates. However, as size estimates are inherently best-effort and can in general be arbitrarily off, Dataflow should control for that and prevent such oversplitting. We'll investigate and fix this.
The only workaround that comes to mind meanwhile is to use the Java SDK: it uses quite different code for reading from BigQuery that, as far as I recall, does not rely on query size estimates.

Dask array to HDF5 parallel write fails with multiprocessing scheduler

Dask being a well documented scalable library for parallel processing, using graph based workflows is extremely useful in writing many applications that have inherent parallelism associated with them. However while parallel writing to hdf5 files being concerned it is rather difficult especially while using multiprocessing scheduler. The following code works fine if default multi-threaded scheduler is used,
x = da.arange(25000, chunks = (1000,))
da.to_hdf5('hdfstore.h5', '/store', x)
But if you set multiprocessing scheduler globally:
dask.set_options(get=dask.multiprocessing.get)
and again run the code,
TypeError: can't pickle _thread.lock objects
The multithreded scheduler is ok, but it is too slow while reading from a single large csv file and converting it to hdf5 file. With the multiprocessing scheduler its fast and able to use all CPUs in maximum load, but the hdf write fails with the mentioned error (the hdf5 files support simultaneous write access with h5py mpi driver, i think). If you directly do
x.compute()
everything is fine but it loads the entire data into memory, that is not it is so well with large arrays and files. Does anybody came across such scenarios? Please do share valuable suggestions..
Dask version '0.13.0' on a conda virtual env
I think the problem with this code is that it writes simultaneously different chunks of data to same hdf5 file when you using multiprocessing scheduler.
As far as I know, HDF5 format supports HDF5 SWMR. So when 1 Python process have the right to write 1 chunk, it prevent other process to write simultaneously, by locking mechanism.
If you want simutaneous write, maybe this could help.

Neo4J Memory tuning having little effect

I am currently running some simple cypher queries (count etc) on a large dataset (>10G) and am having some issues with tuning NE04J.
The machine running the queries has 4TB of ram, 160 cores and is running Ubuntu 14.04/neo4j version 2.3. Originally I left all the settings as default as it is stated that free memory will be dynamically allocated as required. However, as the queries are taking several minutes to complete I assumed this was not the case. As such I have set various combinations of the following parameters within the neo4j-wrapper.conf:
wrapper.java.initmemory=1200000
wrapper.java.maxmemory=1200000
dbms.memory.heap.initial_size=1200000
dbms.memory.heap.max_size=1200000
dbms.jvm.additional=-XX:NewRatio=1
and the following within neo4j.properties:
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=50G
neostore.relationshipstore.db.mapped_memory=50G
neostore.propertystore.db.mapped_memory=50G
neostore.propertystore.db.strings.mapped_memory=50G
neostore.propertystore.db.arrays.mapped_memory=1G
following every guide/Stackoverflow post I could find on the topic, but I seem to have exhausted the available material with little effect.
I am running queries through the shell using the following command neo4j-shell -c < "queries/$1.cypher", but have also tried explicitly passing the conf files with -config $NEO4J_HOME/conf/neo4j-wrapper.conf (restarting the sever everytime I make a change).
I imagine that I have missed something silly which is causing the issue, as there are many reports of neo4j working well with data of this size, but cannot think what it could be. As such any help would be greatly appreciated.
Type :SCHEMA in your neo4j browser to show if you have indexes.
Share a couple of your queries.
In the neo4j.properties file, you need to set the dbms.pagecache.memory setting to about 1.5x the size of your database files. In your example, you can set it to 15g

How to configure Neo4j to run in a minimal memory environment?

For demo purposes, I am running Neo4j in a low memory environment -- A laptop with 4GB of RAM, 1644MB is use for video memory, leaving only 2452 MB available for use.. It's also running SQL Server, our WCF services, and our clients.. So there's little memory for Neo4j.
I'm running LOAD CSV cypher scripts via REST from a C# service. There are more than 20 scripts, and theyt work well in a server environment. I've written code to paginate, so that they run in smaller batches. I've reduced the batch size very low ( 25 csv rows ) and a given script may do 300 batches, but I continue to get "Java heap space" errors at some point.
I've tried configuring Neo4j with a relatively large heap space ( 640MB ) which is all the available RAM size plus setting the cache_type to none, and it gets much further before I get the java heap space error. What I don't understand is in that case, why does it grow that much? Also until I restart the neo4j service, I get these java heap space errors quickly. The batch size doesn't seem to impact how much memory is used appreciably.
However, after doing that, and I run the application with these settings, the query performance becomes very slow due to the cache settings.
I am running this on a Windows 7 laptop with 4G RAM -- using Neo4j 2.2.1 Community Edition.
Thoughts?
Perhaps you can share your LOAD CSV statement and the other queries you run.
I think you just run into this:
http://markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
So PROFILE or EXPLAIN your queries and make it not to use that much intermediate state. We can help if you share your statements.
And you should use PERIODIC COMMIT 100.
Something like:
heap=512M
dbms.pagecache.memory=200M
keep_logical_logs=false
cache_type=none
http://console.neo4j.org runs neo4j in memory putting up to 50 instances in a single gigabyte of memory. So it should be doable.

Resources