Issue Importing the Paradise Papers Dataset into Neo4j - neo4j

Hej all,
I am having an issue with importing the Paradise Papers dataset into a Neo4j (3.3.2) database.
It seems that the data is imported correctly into the database, as reported by neo4j-admin import.
...
IMPORT DONE in 1m 4s 889ms.
Imported:
867931 nodes
1657838 relationships
17838925 properties
Peak memory usage: 488.28 MB
...
However, after importing the data, the database seems to be empty, as reported by the Cypher queries MATCH (n) RETURN count(n); and CALL apoc.meta.graph();
...
count(n)
0
nodes, relationships
[], []
...
The following link points to a script, which should reproduce my issue. It is a Bash script for OS X/BSD (I think the -E switch for sed does not exist on Linux). Additionally, the script requires Docker to be installed and running on the system.
https://github.com/HelgeCPH/cypher_kernel/blob/master/example/import_data.sh
To run the script quickly:
wget https://raw.githubusercontent.com/HelgeCPH/cypher_kernel/master/example/import_data.sh
chmod u+x import_data.sh
./import_data.sh
I cannot see what I am doing wrong. Do I have to point to the database explicitely when running cypher-shell?
Checking on the container, the database files exist (ls -ltrh data/databases/graph.db) and their timestamps correspond to the time when importing the data.
Thanks in advance for your help!

You had multiple errors on your script :
Nodes were not loaded, because in the CSV the :ID column is not set. That's why I have added this part :
for file in import/csv_paradise_papers/.nodes..csv
do
sed -i -E '1s/node_id/node_id:ID/' $file
done
Labels of node were also not set. It's possible to set them directly in the command line like this : --nodes:MyLabel
If you do a query on Neo4j when the server is restarting, you will probably receive an error because the server is not yet ready. That's why I have added a sleep 5 at the end.
A better approach would be to wait until you have response from the server with someting like this :
until $(curl --output /dev/null --silent --head --fail http://localhost:7474); do
printf '.'
sleep 1
done
Last point, I don't know why, but if you do the restart of neo4j inside the container, you will not see the imported data. But if you restart the container itself it's OK ...

Related

Running CQL file in Neo4j

I have a CQL file called Novis.cql. Its somewhere random on my harddrive, but I want to run it in Neo4J to create my graph (it contains 500+ lines of code).
Where do I have to place it? And what command do I have to run nowadays to get it working? I've read and searched for answers, but some of the commands like Neo4jshell dont seem to work any longer...
Any help would be very appreciated!
The cypher-shell tool has been available for a while (starting with version 3.0, if not earlier), and you can use it to execute a Cypher query from a file that can be anywhere in your file system.
For example (on a linux/unix system), a command line like this will work (if you are in the neo4j home directory):
cat /my/full/path/my_code.cql | bin/cypher-shell -u neo4j -p secret
In neo4j 4.0 a new -f option was added to make it simpler:
bin/cypher-shell -u neo4j -p secret -f /my/full/path/my_code.cql

InfluxDB: Move only one database of many from one server instance to another

I have an InfluxDB server instance containing several databases, like sensors, network, telegraf and so on.
Together these databases consume several dozens of GB, and I want to offload only the sensors database to another more powerful machine.
The simplest case would be that I create a new InfluxDB server instance on that other machine, and just move (rsync) the influxdb/data/sensors folder to the other machine, and delete it from the original one.
While I haven't tested it, I assume that this does not work that easily; there is a data/_internal directory, then there's the meta/meta.db file as well as the wal/* directory, which will probably require everything to be left "as-is" in order for the server instance to boot without error.
Since I'm talking about dozens of GBs per database, I'd ideally just would like to mount a new ssd, copy the files/directories, and then mount that new ssd on the other machine and use it directly as the new data source without further copying.
I'd basically wish I could do this in a way as easy as moving rrd-tool's rrd files from one machine to another.
Is this possible? If not, what are my options?
Edit 2022: This is a solution which works for InfluxDB 1.x, the commands shown here may not be directly applicable to 2.x. Here is a link to the 2.x backup/restore documentation: https://docs.influxdata.com/influxdb/v2.2/backup-restore/
The InfluxDB 2.2 influx backup command is not compatible with versions of InfluxDB prior to 2.0.0.
I resorted to using influxd backup / influxd restore as Yuri Lachin pointed out.
While it does have the drawback of first needing to save the data on disc and then read it in from there, it seems to be the the most flexible approach.
Rsyncing 50GB does take a certain amount, and the databases would need to be offline during that time, which is not a requirement for backup / restore; so no data is lost. It also allows to migrate the data which used to be on one single InfluxDB instance to different InfluxDB servers without having to think about the issue with the metadata database.
The backup / restore can be done in steps, where the first step ist to initially backup all the data of the database, restore it into the new server instance, and then exporting the newest data again which didn't make it into the first backup, restoring it again into the new database.
Step 1:
On the machine containing the new, empty InfluxDB server instance, backup the data from the remote, old InfluxDB instance:
influxd backup \
-portable \
-host 192.168.11.10:8088 \
-database sensors \
/var/lib/influxdb/export-sensors-01
Afterwards import this data into the new server instance:
influxd restore \
-portable \
/var/lib/influxdb/export-sensors-01
Step 2:
Now take the time to adjust the IP-address or domain name to which the InfluxDB clients are currently connected, and make them point to the new InfluxDB server; restart the clients if necessary.
Step 3:
During the time the backup finished and you restarted the clients with the new IP-address, new data was still written to the old database, so we will need to sync that data over.
Again, on the new server, pull a backup from the old one, but specify the time range of the missing data and a different target directory:
influxd backup \
-portable \
-host 192.168.11.10:8088 \
-database sensors \
-start 2019-06-22T19:30:00Z \
-end 2019-06-24T00:00:00Z \
/var/lib/influxdb/export-sensors-02
Apparently it is important to specify -end as well, one test I did which had no -end argument started to backup the entire database again. I just ctrl-d'd out of it and deleted /var/lib/influxdb/export-sensors-02 and started it again with the -end argument set.
The -start argument can contain a couple of minutes of the data which already got restored, since during restoring this second backup these duplicated entries will be ignored or overwrite the already existing identical values.
For example, if you start the main backup at 4pm and it finishes at 6pm, the second backup can contain a -start argument of 5:55pm and an -end argument a couple of days in the future, which is no problem, because as soon as you switch the IP-addresses of the client, no more future data will be written to the old database. Probably the -since argument would have been better, but I was experimenting a bit with time ranges so I left it at using -start+-end.
In order to insert the missing data which you just backed up into /var/lib/influxdb/export-sensors-02 you need to do a bit more work, since you can't restore into an already existing database. If you try to do it, nothing is damaged, only a warning message is shown and restore gets aborted.
So we will need to restore the data into a new, temporary database:
influxd restore \
-portable \
-database sensors \
-newdb sensors_tmp_backup \
/var/lib/influxdb/export-sensors-02
Then copy the data into the sensors database:
influx \
-database=sensors_tmp_backup \
-execute 'SELECT * INTO sensors..:MEASUREMENT FROM /.*/ GROUP BY *'
And delete the temporary database:
influx \
-database=sensors_tmp_backup \
-execute 'DROP DATABASE sensors_tmp_backup'
If all is OK, delete the backup directories
rm -rf /var/lib/influxdb/export-sensors-01
rm -rf /var/lib/influxdb/export-sensors-02
Before changing the addresses with Step 2, you can test Step 3 a couple of times, by making the new db catch up the old, current one via a couple of backups. It's a good way to get acquainted with the procedure in Step 3.
If you're running InfluxDB in Docker, like I am doing, you can execute all the commands from the host. Step 3 would then look like this:
docker exec -w /var/lib/influxdb/ influxdb-1.7.6 influxd backup -portable -host 192.168.11.10:8088 -database sensors -start 2019-06-22T19:40:00Z -end 2019-06-24T00:00:00Z /var/lib/influxdb/export-sensors-02
docker exec -w /var/lib/influxdb/ influxdb-1.7.6 influxd restore -portable -database sensors -newdb sensors_tmp_back /var/lib/influxdb/export-sensors-02
docker exec -w /var/lib/influxdb/ influxdb-1.7.6 influx -database=sensors_tmp_back -execute 'SELECT * INTO sensors..:MEASUREMENT FROM /.*/ GROUP BY *'
docker exec -w /var/lib/influxdb/ influxdb-1.7.6 influx -database=sensors_tmp_back -execute 'DROP DATABASE sensors_tmp_back'
docker exec -w /var/lib/influxdb/ influxdb-1.7.6 rm -rf /var/lib/influxdb/export-sensors-01
docker exec -w /var/lib/influxdb/ influxdb-1.7.6 rm -rf /var/lib/influxdb/export-sensors-02
If you are having problems accessing the remote InfluxDB server keep in mind that the RPC-port 8088 is usually bound to localhost for security reasons, so you may need to bind it to 0.0.0.0 first, probably by setting the environment variable INFLUXDB_BIND_ADDRESS on the remote instance to 0.0.0.0:8088, as specified in the documentation, and then restarting the server.
Not sure it is safe to rsync influxdb/data/sensors directory files from a running influxdb instance. At least you should copy files with rsync and a running influxd, then stop influxd service and repeat rsync to fetch recently updated files.
Without copying `influxdb/meta/meta.db' to a new server your new instance won't know about existing old databases and measurements.
AFAIK, the procedure of manual file copying is not officially documented or recommended by InfluxData.
Probably using official influxd backup / influxd restore commands is a safer approach. They were buggy 1-2 years ago when I tried them, but are likely to work now. You can run backup on a new server from remote old instance and restore backup locally.
I may try as you mentioned in your question copy influxdb/data/sensors directory to the new machine.
_internal database maintains the run time statistics. So you can ignore that if you are not looking into that database.
I am ignorant where it is using its metadata, so be cautious.
wal/* - directory is nothing but write ahead log to avoid data loss. I assume you have some downtime for this activity. If you can find most recent data within sensor DB before you do this copying, there is not chance for data loss from wal.

neo4j-shell example of running a Cypher script

I need to run a Cypher query against a Neo4J database, from a command line (for batch scheduling purposes).
When I run this:
./neo4j-shell -file /usr/share/neo4j/scripts/query.cypher -path /usr/share/neo4j/neo4j-community-3.1.1/data/databases/graph.db
I get this error:
ERROR (-v for expanded information):
Error starting org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory, /usr/share/neo4j/neo4j-community-3.1.1/data/databases/graph.db
There is a running Neo4J instance on that database (localhost:7474). I need the script to perform queries against it.
NOTE: this is a split of the original question, for the sake of tidiness.
To execute (one or more) Cypher statements from a file while the neo4j server is running, you can use the APOC procedure apoc.cypher.runFile(file or url).
Since you mention "batch scheduling", the Job management and periodic execution APOC procedures may be helpful. Those procedures could, in turn, execute calls to apoc.cypher.runFile.
Okay I just spun up a fresh instance of Neo4j-community-3.1.1 today and ran into the exact same problem. Note that I had already created a database using the bulk import tool, so one might need to make a directory for a database (mkdir data/databases/graph.db) before using a shell.
I believe your problem might be that you have an instance of Neo4j process running against the database you are trying to access.
For me, shutting down Neo4j, and then starting the shell with an explicit path worked:
cd /path/to/neo4j-community-3.1.1/
bin/neo4j stop ## assuming it is already running (may need a port specifier)
bin/neo4j-shell -path data/databases/graph.db
For some reason I thought you could have both the shell and the server running, but apparently that is not the case. Hopefully someone will correct me if I am wrong.

Docker. How to resume downloading image when interrupted?

How can I resume pull when disconnected? The pull process always start from the beginning every time I run docker pull some-image again after disconnected. My connection is so unstable that even downloading just a 100MB image take so long and almost fails every time. So, it is almost impossible for me to pull a bigger image. So, how can I resume the pull process?
Update:
The pull process will now automatically resume based on which layers have already been downloaded. This was implemented with https://github.com/moby/moby/pull/18353.
Old:
There is no resume feature yet. However there are discussions around this feature being implemented with docker's download manager.
Docker's code isn't as updated as the moby in development repository on github. People have been having issues for several years relating to this. I had tried to manually use several patches which aren't in the upstream yet, and none worked decent.
The github repository for moby (docker's development repo) has a script called download-frozen-image-v2.sh. This script uses bash, curl, and other things like JSON interpreters via command line. It will retrieve a docker token, and then download all of the layers to a local directory. You can then use 'docker load' to insert into your local docker installation.
It does not do well with resume though. It had some comment in the script relating to 'curl -C' isn't working. I had tracked down, and fixed this problem. I made a modification which uses a ".headers" file to retrieve initially, which has always returned a 302 while I've been monitoring, and then retrieves the final using curl (+ resume support) to the layer tar file. It also has to loop on the calling function which retrieves a valid token which unfortunately only lasts about 30 minutes.
It will loop this process until it receives a 416 stating that there is no resume possible since it's ranges have been fulfilled. It also verifies the size against a curl header retrieval. I have been able to retrieve all images necessary using this modified script. Docker has many more layers relating to retrieval, and has remote control processes (Docker client) which make it more difficult to control, and they viewed this issue as only affecting some people on bad connections.
I hope this script can help you as much as it has helped me:
Changes:
fetch_blob function uses a temporary file for its first connection. It then retrieves 30x HTTP redirect from this. It attempts a header retrieval on the final url and checks whether the local copy has the full file. Otherwise, it will begin a resume curl operation. The calling function which passes it a valid token has a loop surrounding retrieving a token, and fetch_blob which ensures the full file is obtained.
The only other variation is a bandwidth limit variable which can be set at the top, or via "BW:10" command line parameter. I needed this to allow my connection to be viable for other operations.
Click here for the modified script.
In the future it would be nice if docker's internal client performed resuming properly. Increasing the amount of time for the token's validation would help tremendously..
Brief views of change code:
#loop until FULL_FILE is set in fetch_blob.. this is for bad/slow connections
while [ "$FULL_FILE" != "1" ];do
local token="$(curl -fsSL "$authBase/token?service=$authService&scope=repository:$image:pull" | jq --raw-output '.token')"
fetch_blob "$token" "$image" "$layerDigest" "$dir/$layerTar" --progress
sleep 1
done
Another section from fetch_blob:
while :; do
#if the file already exists.. we will be resuming..
if [ -f "$targetFile" ];then
#getting current size of file we are resuming
CUR=`stat --printf="%s" $targetFile`
#use curl to get headers to find content-length of the full file
LEN=`curl -I -fL "${curlArgs[#]}" "$blobRedirect"|grep content-length|cut -d" " -f2`
#if we already have the entire file... lets stop curl from erroring with 416
if [ "$CUR" == "${LEN//[!0-9]/}" ]; then
FULL_FILE=1
break
fi
fi
HTTP_CODE=`curl -w %{http_code} -C - --tr-encoding --compressed --progress-bar -fL "${curlArgs[#]}" "$blobRedirect" -o "$targetFile"`
if [ "$HTTP_CODE" == "403" ]; then
#token expired so the server stopped allowing us to resume, lets return without setting FULL_FILE and itll restart this func w new token
FULL_FILE=0
break
fi
if [ "$HTTP_CODE" == "416" ]; then
FULL_FILE=1
break
fi
sleep 1
done
Try this
ps -ef | grep docker
Get PID of all the docker pull command and do a kill -9 on them. Once killed, re-issue the docker pull <image>:<tag> command.
This worked for me!

How to change the memory config for the Neo4j-shell command tool of Cypher batch import

I am trying to use neo4j-shell command tool to do Cypher batch import. I followed the instructions described in Import data into your neo4j database from the neo4j-shell command. Here was the command that I ran:
import-cypher -d "," -i c://temp//neo//import.csv -o c://temp//neo//out.csv start n=node:employee_idx(EmpID={emp_id}), m=node:permit_idx(PmtID={pmtid}) create n<-[:Assign{AssID:{assid}}]-m
If there were only 100000 records in import.csv file, it ran perfectly. But if there were 200000 records in import.csv file, I got error: Error occurred in server thread; nested exception is: java.lang.OutOfMemory Error: Java heap space.
How to change the default memory config of this tool?
You need to set the environment variable JAVA_OPTS to appropriate values, e.g. on Linux it can be done using
JAVA_OPTS="-Xmx4G" bin/neo4j-shell

Resources