Neo4j staged batch import - neo4j

I want to import existing entities and their relationships from MySQL database to a new Neo4j db. I have several questions that I still do not quite understand -
Based on the description of the batch importer, it appears as if I need to have both an entity and relationship file. Can I execute an import without one or the other file type?
Can I execute a series of batch imports, using different files for different entities?

Are you using the batch importer from the Neo4j website or the one by jexp/Michael Hunger ?
If it's the jexp batch-import you could execute just the entity/nodes file (resulting in a bunch of nodes and no edges) or just the rels file (resulting in an empty graph since there's no nodes to connect). Or you could import the nodes, then import the rels, either in the same import or in a series of imports.

Related

py2neo (Neo4j) bulk operation

I have created a nice graph using Neo4j Desktop, using a local CSV file that I added to the project. now I'm trying to use python to do that automatically for that I'm trying to use methods from here https://py2neo.org/2021.1/bulk/index.html like create/merge nodes/relatioships.
I have 2 questions,
Do I need to use create method (create node for exmaple) and after that to use merge method (merge nodes for this example) or can use merge nodes from the begining?
I have tried to use merge only and I got some wierd things when I'm using large sample size.
2)After creation of nodes and relationships how can I change the visualization of the nodes (put some value in the node)?
*If there is an other way to use python to create graph from a big CSV file I would like to hear, Thanks!

Create node with specific internal id using LOAD CSV

for initial data import (step 1) into Neo4j database I'm using the neo4j-admin import tool. There you can specify the internal id of a node by :ID in the header.
I would also like to use LOAD CSV command for creating more nodes (step 2) into an already existing database (with data from previous step). I can't find the answer on how to specify internal id of a node by using this command.
Why it is not possible, while at initial import it is? In second step I'm having a similar csv files as in the first step, which means I have a csv file of nodes with first column being an id of a node AND I have a csv file of relationships between them with columns be like start_id,end_id,relationshipType.
Thanks, Petr M
You have misunderstood how the IDs contained in the import input files are used. They are not used as native IDs.
The IDs in the input data files are only used to enable the import tool to know that node X in file A is supposed to be the same as node Y in file B. After the import is completed, the IDs from the files are forgotten.
Whenever a node is created, the neo4j server always decides on its own what the actual native ID will be.
Also, it is never recommended to store native IDs in the DB, since it is not a reliable way to identify a specific entity (node or relationship) over time. After an entity is deleted, its native ID can be re-assigned to a new entity.

Neo4j APOC export/import with primary keys instead of internal ids

I am trying to import multiple CSVs, exported from multiple different Neo4j databases with APOC's export, into one big database. Some of the nodes are shared.
There is a problem that relationships in the CSV use the Neo4j's internal IDs for the _start and _end, instead of the nodes' "primary key" -- is the #Index with primary = true (same as #Id) a thing of the Neo4j or the Neo4j's Java OGM?. This is bad because these multiple exports could (and will) have same internal IDs for different nodes and the merged graph will be a mess. The same applies for nodes, I want to merge them based on the primary key during the import instead of creating duplicates.
Is there a way to export a Neo4j database with APOC in a way that it relies on primary keys instead of internal IDs? I need a CSV or JSON file, no CQL. Or is there another way of exporting a Neo4j database in a way that I can import multiple exports and they will merge seamlessly? ...something different than writing my CSV exporter and imported, this will be the very last option.

Multiple Nodes per Line in Neo4j Batch Import Tool

Using Neo4j's Batch Import Tool, how can I create multiple nodes from a single row, and then attribute some properties to Node 1 and some to Node 2?
This is an example from 29.3:
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
Is there a way to make it so title is "movieId.title" and year is its own ID? Then I can abstract that out to multiple nodes.
The import tool (in contrast to LOAD CSV) expects exactly one node per line. So you have to use some preprocessing to make the format fitting your desired graph model.
Typical candidates for this a csvkit or the usual suspects from a unix command line: sed, awk, ...
In your case I'd strip out the title into a separate file for creating the :Title nodes, and create another csv file for the relationships between movies and titles.
You can re-use the same csv file but use two different header files, with different columns used as :ID and columns you don't want for this node as :IGNORED
As the header is independent from the data you can use that approach to pull in the same file several times for different nodes, relationships, etc.
It's also explained here: http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets

How to import unique data with Talend?

I have 100M datasets in Oracle and try to import all these datasets into Neo4j with Talend. My question is, since the 100M datasets is updating everyday, how can I make sure Talend will only import datasets which are not already existed in the neo4j database? In other words, talend will only import the updated datasets.
For example, suppose Neo4j contains 38890, 38891, 38892 right now. In Oracle, the updated datasets are 38890,38891, 38892, 38893. The expected result is 38893 will be the imported only.
The datasets is very large, it seems not very efficienct to import these datasets to Neo4j everyday and delete the duplicate. Could anyone help me out please? Thanks in advance.
You should to do 2 loads, 1 for the initial FULL Load, just like you do it now and another one for the daily incremental loads.
Check your primary keys and find a way to make a SELECT query which will return your new/modified rows. You need another query which will show you which rows had been deleted / modified as you need to remove these rows before adding the new/modified rows into your db.
To run this automatically you need to right click on your job and select "export Job" It will build your job into a JAVA JAR file. With a .sh and .bat launcher. You can then use the windows scheduler to execute this daily, or use CRON to execute it daily if you happen to have a linux server.
You certainly have an updated timestamp on your tables in oracle, so I would use that to filter out the data that was only updated since the last import, which would be much less data, e.g. 1-5M rows.
For those entries you can have a unique constraint and then use cypher with the MERGE on the entries which is a get-or-create.
Make sure to use parameters for updating the data, against the embedded or server APIs
FOREACH (p in {people} |
MERGE (person:Person {name:{p.name}})
ON CREATE SET person.age = p.age, ...
}

Resources