Importing bulk json data into neo4j - neo4j

I am trying to load json file of size about 700k. But it is showing me the heap memory out of space error.
My query is as below:
WITH "file:///Users//arundhathi.d//Documents//Neo4j//default.graphdb//import//tjson.json" as url
call apoc.load.json(url) yield value as article return article
Like in csv I tried to use USING PERIODIC COMMIT 1000 with json. But I am not allowed to use with loading json.
How to load bulk json data?.

You can also convert JSON into CSV files using jq - a uber fast json converter. https://stedolan.github.io/jq/tutorial/
This is the recommended way according to: https://neo4j.com/blog/bulk-data-import-neo4j-3-0/
If you have many files, write a python program or similar that iterates through the length of files calling:
os.system("cat file{}.json | jq '. [.entity1, .entity2, .entity3] | #csv' >> concatenatedCSV.csv".format(num))
or in Go:
exec.Command("cat file"+num+".json | jq '. [.entity1, .entity2, .entity3] | #csv' >> concatenatedCSV.csv")
I recently did this for about 700GB of JSON files. It takes some thought to get the csv files in the right format, but if you follow the tutorial on jq you'll pickup how to do it. Additionally, check out how the headers need to be and what not here: https://neo4j.com/docs/operations-manual/current/tools/import/
It took about a day to convert it all, but given the transaction overhead of using apoc, and the ability to reimport at anytime once the files are in the format it is worth it in the long run.

apoc.load.json now supports a json-path as a second parameter.
To get the first 1000 JSON objects from the array in the file, try this:
WITH "file:///path_to_file.json" as url
CALL apoc.load.json(url, '[0:1000]') YIELD value AS article
RETURN article;
The [0:1000] syntax specifies a range of array indices, and the second number is exclusive (so, in this example, the last index in the range is 999).
The above should at least work in neo4j 3.1.3 (with apoc release 3.1.3.6). Note also that the Desktop versions of neo4j (installed via the Windows and OSX installers) have a new requirement concerning where to put plugins like apoc in order to import local files.

Related

How to upload Polygons from GeoPandas to Snowflake?

I have a geometry column of a geodataframe populated with polygons and I need to upload these to Snowflake.
I have been exporting the geometry column of the geodataframe to file and have tried both CSV and GeoJSON formats, but so far I either always get an error the staging table always winds up empty.
Here's my code:
design_gdf['geometry'].to_csv('polygons.csv', index=False, header=False, sep='|', compression=None)
import sqlalchemy
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
engine = create_engine(
URL(<Snowflake Credentials Here>)
)
with engine.connect() as con:
con.execute("PUT file://<path to polygons.csv> #~ AUTO_COMPRESS=FALSE")
Then on Snowflake I run
create or replace table DB.SCHEMA.DESIGN_POLYGONS_STAGING (geometry GEOGRAPHY);
copy into DB.SCHEMA."DESIGN_POLYGONS_STAGING"
from #~/polygons.csv
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' SKIP_HEADER = 1 compression = None encoding = 'iso-8859-1');
Generates the following error:
"Number of columns in file (6) does not match that of the corresponding table (1), use file format option error_on_column_count_mismatch=false to ignore this error File '#~/polygons.csv.gz', line 3, character 1 Row 1 starts at line 2, column "DESIGN_POLYGONS_STAGING"[6] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client."
Can anyone identify what I'm doing wrong?
Inspired by #Simeon_Pilgrim's comment I went back to Snowflake's documentation. There I found an example of converting a string literal to a GEOGRAPHY.
https://docs.snowflake.com/en/sql-reference/functions/to_geography.html#examples
select to_geography('POINT(-122.35 37.55)');
My polygons looked like strings describing Polygons more than actual GEOGRAPHYs so I decided I needed to be treating them as strings and then calling TO_GEOGRAPHY() on them.
I quickly discovered that they needed to be explicitly enclosed in single quotes and copied into a VARCHAR column in the staging table. This was accomplished by modifying the CSV export code:
import csv
design_gdf['geometry'].to_csv(<path to polygons.csv>,
index=False, header=False, sep='|', compression=None, quoting=csv.QUOTE_ALL, quotechar="'")
The staging table now looks like:
create or replace table DB.SCHEMA."DESIGN_POLYGONS_STAGING" (geometry VARCHAR);
I ran into further problems copying into the staging table related to the presence of a polygons.csv.gz file I must have uploaded in a previous experiment. I deleted this file using:
remove #~/polygons.csv.gz
Finally, converting the staging table to GEOGRAPHY
create or replace table DB.SCHEMA."DESIGN_GEOGRAPHY_STAGING" (geometry GEOGRAPHY);
insert into DB.SCHEMA."DESIGN_GEOGRAPHY"
select to_geography(geometry)
from DB.SCHEMA."DESIGN_POLYGONS_STAGING"
and I wound up with a DESIGN_GEOGRAPHY table with a single column of GEOGRAPHYs in it. Success!!!

Neo4j- convert Time String into Integer in query

How can I convert String Time into Integer in Neo4j. I have been trying for quite a long time, but there does not seem to be any clear solution for it. Only solution, I found is to add a new property to my node while loading it from CSV. I don't want to do this.
I have Time in the following format:
"18:11:00"
and I want to do some subtraction on them.
I tried doing the following but to no avail:
match (st1:Stoptime)-[r:PRECEDES]->(st2:Stoptime)
return st1.arrival_time, toInt(st1.arrival_time)
limit 10
But I get following output:
"18:11:00" null
You can install APOC procedures and do it using the function apoc.date.parse:
return apoc.date.parse('18:11:00','s','HH:mm:ss')
Running this example the output will be:
╒════════════════════════════════════════════╕
│"apoc.date.parse("18:11:00",'s','HH:mm:ss')"│
╞════════════════════════════════════════════╡
│65460 │
└────────────────────────────────────────────┘
The first parameter is the date/time to be parsed. The second parameter is the target time unit. In this case I have specified seconds (s). The third parameter indicates the date/time format of the first parameter.
Note: remember to install APOC procedures according the version of Neo4j you are using. Take a look in the version compatibility matrix.

How to limit Jenkins API response to last n build IDs

http://xxx/api/xml?&tree=builds[number,description,result,id,actions[parameters[name,value]]]
Above API returns all the build IDs. Is there a way to limit results to get last 5 build IDS?
The tree query parameter allows you to explicitly specify and retrieve only the information you are looking for, by using an XPath-ish path expression. The value should be a list of property names to include, with sub-properties inside square braces. Try tree=jobs[name],views[name,jobs[name]] to see just a list of jobs (only giving the name) and views (giving the name and jobs they contain). Note: for array-type properties (such as jobs in this example), the name must be given in the original plural, not in the singular as the element would appear in XML (). This will be more natural for e.g. json?tree=jobs[name] anyway: the JSON writer does not do plural-to-singular mangling because arrays are represented explicitly.
For array-type properties, a range specifier is supported. For example, tree=jobs[name]{0,10} would retrieve the name of the first 10 jobs. The range specifier has the following variants:
{M,N}: From the M-th element (inclusive) to the N-th element (exclusive).
{M,}: From the M-th element (inclusive) to the end.
{,N}: From the first element (inclusive) to the N-th element (exclusive). The same as {0,N}.
{N}: Just retrieve the N-th element. The same as {N,N+1}.
Another way to retrieve more data is to use the depth=N query parameter . This retrieves all the data up to the specified depth. Compare depth=0 and depth=1 and see what the difference is for yourself. Also note that data created by a smaller depth value is always a subset of the data created by a bigger depth value.
Because of the size of the data, the depth parameter should really be only used to explore what data Jenkins can return. Once you identify the data you want to retrieve, you can then come up with the tree parameter to exactly specify the data you need.
I'm on version 1.509.4. which doesn't support range specifier.
Source: http://ci.citizensnpcs.co/api/
You can create an xml object with the build numbers via xpath and parse it yourself with via different means.
http://xxx/api/xml?xpath=//build/number&wrapper=meep
Creates an xml that looks like:
<meep>
<number>n</number>
<number>n+1</number>
...
<number>m</number>
</meep>
And will be populated with the build numbers n through m that are currently in jenkins for the specified job in the url. You can substitute anything for the word "meep", that will become the wrapper object for the newly created xml object.
How are you collecting/manipulating the api xml output once you get it? Because there is a solution here for How do I select the last N elements with XPath?. I tried using some of these xpath manipulations but I couldn't get it to work when playing with the url in my browser; it might work if you are doing something else.
When I get the xml object, I happen to manipulate it via shell scripts.
#!/bin/sh
# NOTE: To get the url to work with curl, you need a valid jenkins user and api token
# Put all build numbers in a variable called build_ids
build_ids="$(curl -sL --user ${_jenkins_api_user}:${_jenkins_api_token} \
"${_jenkins_url}/job/${_job_name}/api/xml?xpath=//build/number&wrapper=meep" \
| sed -e 's/<[^>]*>/ /g' | sed -e 's/ / /g')"
# Print the last 5 items with awk
echo "${build_ids}" | awk '{n = 5; for (--n; n >= 0; n--){ printf "%s\t",$(NF-n)} print ""}';
Once you have your xml object you can essentially parse it however you want.
NOTE: I am running Jenkins ver. 2.46.1
Looking at the doco at the raw .../api/ endpoint (on Jenkins 2.60.3) it says
For array-type properties, a range specifier is supported. For
example, tree=jobs[name]{0,10} would retrieve the name of the first 10
jobs. The range specifier has the following variants:
{M,N}: From the M-th element (inclusive) to the N-th element (exclusive).
{M,}: From the M-th element (inclusive) to the end.
{,N}: From the first element (inclusive) to the N-th element (exclusive). The same as {0,N}.
{N}: Just retrieve the N-th element. The same as {N,N+1}.
For the OP's case, you'd append {,5} to the end of the URL to get the first 5 results:
http://xxx/api/xml?&tree=builds[number,description,result,id,actions[parameters[name,value]]]{,5}

store temp variables in neo4j

I have some cypher queries that I execute against my neo4j database. The query is in this form
MATCH p=(j:JOB)-[r:HAS|STARTS]->(s:URL)-[r1:VISITED]->(t:URL)
WHERE j.job_id =5000 and r1.origin='iframe' and r1.job_id=5000 AND NOT (t.netloc =~ 'VERY_LONG_LIST')
RETURN count(r1) AS number_iframes;
If you can't understand what I am doing. This is a much simpler query
MATCH (s:WORD)
WHERE NOT (s.text=~"badword1|badword2|badword3")
RETURN s
I am basically trying to match some words against specific list
The problem is that this list is very large as you can see my job_id=5000 and I have more than 20000 jobs, so if my whitelist length is 1MB then I will end up with very large queries. I tried 500 jobs and end up with 200 MB queries file.
I was trying to execute these queries using transactions from py2neo but this is wont be feasible because my post request length will be very large and it will timeout. As a result, I though of using
neo4j-shell -file <queries_file>
However as you can see the file size is very large because of the large whitelist. So my question is there anyway that I can store this "whitelist" in a variable in neo4j using cypher??
I wish if there is something similar to this
SAVE $whitelist="word1,word2,word3,word4,word5...."
MATCH p=(j:JOB)-[r:HAS|STARTS]->(s:URL)-[r1:VISITED]->(t:URL)
WHERE j.job_id =5000 and r1.origin='iframe' and r1.job_id=5000 AND NOT (t.netloc =~ $whitelist)
RETURN count(r1) AS number_iframes;
What datatype is your netloc?
If you have an index on netloc you can also use t.netloc IN {list} where {list} is a parameter provided from the outside.
Such large regular expressions will not be fast
What exactly is your regexp and netloc format like? Perhaps you can change that into a split + index-list lookup?
In general also for regexps you can provide an outside parameter.
You can also use "IN" + index for job_ids.
You can also run a separate job that tags the jobs within your whitelist with a label and use that label for additional filtering e.g. in the match already.
Why do you have to check this twice ? Isn't it enough that the job has id=5000?
j.job_id =5000 and r1.job_id=5000

How can I prevent SQL injections during CSV uploads?

I've just started learning about Rails security, and I'm wondering how I can avoid security issues while allowing users to upload CSV files into our database. We're using Postgres' "copy from stdin" functionality to upload the data from the CSV into a temp table, which is then used for upserts into another table. This is the basic code (thanks to this post):
conn = ActiveRecord::Base.connection_pool.checkout
raw = conn.raw_connection
raw.exec("COPY temp_table (col1, col2) FROM STDIN DELIMITER '|'")
# read column values from the CSV line by line in the following format:
# attributes = {column_1: 'column 1 data', column_2: 'column 2 data'}
# line = "#{attributes.values.join('|')}\n"
rc.put_copy_data line
# wrap up copy process & insert into & update primary table
I am wondering what I can or should do to sanitize the column values. We're using Rails 3.2 and Postgres 9.2.
No action is required; COPY never interprets the values as SQL syntax. Malformed CSV will produce an error due to bad quoting / incorrect column count. If you're sending your own data line-by-line you should probably exclude a line containing a single \. followed by a newline, but otherwise it's rather safe.
PostgreSQL doesn't sanitize the data in any way, it just handles it safely. So if you accept a string ');DROP TABLE customer;-- in your CSV it's quite safe in COPY. However, if your application reads that out of the database, assumes that "because it came from the database not the user it's safe," and interpolates it into an SQL string you're still just as stuffed.
Similarly, incorrect use of PL/PgSQL functions where EXECUTE is used with unsafe string concatenation will create problems. You must use of format and the %I or %L specifiers, use quote_literal / quote_ident, or (for literals) use EXECUTE ... USING.
This is not just true of COPY, it's the same if you do an INSERT of the manipulated data then use it unsafely after reading it back from the DB.

Resources