Do the apoc.import use merge or create to add new data? - neo4j

CALL apoc.import.csv(
[{fileName: 'file:/persons.csv', labels: ['Person']}],
[{fileName: 'file:/knows.csv', type: 'KNOWS'}],
{delimiter: '|', arrayDelimiter: ',', stringIds: false}
)
For this example, internally, does the 'import' use merge or create to add nodes, relationships and properties? I tested, it seems it uses 'create' to add new rows even for a new ID record. Is there a way to control this? When to use apoc.load VS apoc.import? It seems apoc.load is a lot more flexible, where users can choose to use cypher commands specifically for purposes. Right?

From the source of CsvEntityLoader (which seems to be doing the work under the covers), nodes are blindly created rather than being merged.
While there's an ignoreDuplicateNodes configuration property you can set, it just ignores IDs duplicated within the incoming CSV (i.e. it's not de-duplicating the incoming records against your existing graph). You could protect yourself from creating duplicate nodes by creating an appropriate unique constraint on any uniquely-identifying properties, which would at least prevent you accidentally running the same import twice.
Personally I'd only use apoc.import.csv to do a one-off bulk load of data into a fresh graph (or to load a dump from another graph that was exported as a CSV by something like apoc.export.csv.*). And even then, you've got the batch import tool that'll do that job with higher performance for large datasets.
I tend to use either the built-in LOAD CSV command or apoc.load.csv for most things, as you can control exactly what you do with each record coming in from the file (such as performing a MERGE rather than a CREATE).

As indicated by by #Pablissimo's answer, the ignoreDuplicateNodes config option (when explicitly set to true) does not actually check for duplicates in the DB - it just checks within the file. A request to address this hole was brought up before, but nothing has been done yet to address it. So, if this is a concern for your use case, then you should not use apoc.import.csv.
The rest of this answer applies iff your files never specify nodes that already exist in your DB.
If your node CSV file follows the neo4j-admin import command's import file header format and has a header that specifies the :ID field for the column containing the node's unique ID, then the apoc.import.csv procedure should, by default, fail when it encounters duplicate node IDs (within the same file). That is because the procedure's ignoreDuplicateNodes config value defaults to false (you can specify true to skip duplicate IDs instead of failing).
However, since your node imports are not failing but are generating duplicate nodes, that implies your node CSV file does not specify the :ID field as appropriate. To fix this, you need to add the :ID field and call the procedure with the config option ignoreDuplicateNodes:true. Or, you can modify those CSV files somehow to remove duplicate rows.

Related

Do not replace existing file when uploading file with same name Microsoft Graph API

How can I prevent replacing the existing file with a new file which has the same name, when I upload file to one drive?
I am using PUT /me/drive/items/{parent-id}:/{filename}:/content docs end point.
I instead need to keep indexing (test.jpg, test (1).jpg) or, just like google drive does, add two files with the same name.
You can control this behavior using Instance Attributes, specifically the #microsoft.graph.conflictBehavior query parameter. There are three supported conflict behaviors; fail, replace (the default), and rename.
The conflict resolution behavior for actions that create a new item. You can use the values fail, replace, or rename. The default for PUT is replace. An item will never be returned with this annotation. Write-only.
In order to have it automatically rename the file, you add #microsoft.graph.conflictBehavior=rename as a query parameter to your URI.
PUT /me/drive/items/{parent-id}:/{filename}:/content?#microsoft.graph.conflictBehavior=rename

Neo4j APOC import error

I have a data model that starts with a single record, this has a custom "recordId" that's a uuid, then it relates out to other nodes and they then in turn relate to each other. That starting node is what defines the data that "belongs" together, as in if we had separate databases inside neo4j. I need to export this data, into a backup data-set that can be re-imported into either the same or a new database with ease
After some help, I'm using APOC to do the export:
call apoc.export.cypher.query("MATCH (start:installations)
WHERE start.recordId = \"XXXXXXXX-XXX-XXX-XXXX-XXXXXXXXXXXXX\"
CALL apoc.path.subgraphAll(start, {}) YIELD nodes, relationships
RETURN nodes, relationships", "/var/lib/neo4j/data/test_export.cypher", {})
There are then 2 problems I'm having:
Problem 1 is the data that's exported has internal neo4j identifiers to generate the relationships. This is bad if we need to import into a new database and the UNIQUE IMPORT ID values already exist. I need to have this data generated with my own custom recordIds as the point of reference.
Problem 2 is that the import doesn't even work.
call apoc.cypher.runFile("/var/lib/neo4j/data/test_export.cypher") yield row, result
returns:
Failed to invoke procedure apoc.cypher.runFile: Caused by: java.lang.RuntimeException: Error accessing file /var/lib/neo4j/data/test_export.cypher
I'm hoping someone can help me figure out what may be going on, but I'm not sure what additional info is helpful. No one in the Neo4j slack channel has been able to help find a solution.
Thanks.
problem1:
The exported file does not contain any internal neo4j ids. It is not safe to use neo4j ids out of the database, since they are not globally unique. So you should not use them to transfer data from one database to another.
If you are about to use globally uniqe ids, you can use an external plugin like GraphAware UUID plugin. (disclaimer: I work for GraphAware)
problem2:
If you cannot access the file, then possible reasons:
apoc.import.file.enabled=true is not set in neo4j.conf
os level
permission is not set

Is there a way to avoid the target table in informatica (powercenter)?

In informatica mapping design, there must be a target table, but in my design, I only use informatica to call store procedures, and after they were called, all work has been done, so I don't need a target table to be inserted or updated.
I used a non-exist table as the target table, and one nonsense field as the input port(cause there must be at least one input port!), then unchecked or the option(insert, update,delete) in the session configuration, so that the informatica would not generated DML SQL statements, avoiding "no table" errors.
But then informatica treat the input row as reject row and try to write it into a bad file. And cause I unchecked the insert option, the session log showed that there was an error that it couldn't be insert into the bad file!
Strangely, this error never showed in the monitor, and all session run successfully! It only appeared in informatica's meta table.
Is there a better way to avoid this problem, although it has no effect to my result? Is there a possibility to use a non-exist table and do nothing to it (include reject the input rows)?
Use a filter transformation just before the target and put filter condition 'FALSE'
No rows will go to the target
I had run into this same issue when i wanted to just execute a stored procedure and nothing else.
I solved this by creating a dummy source object that had one port and a dummy target with one port of the same datatype. In the source qualifier I added a SQL statement select 1 from dual (since it's Oracle).
I then added a filter object that was set to false. Then I connected the single port from the source/qualifier through the filter and finally to the target.
When the mapping is run, the source qualifier will return 1 row of one value, this will pass through to the filter but nothing will come out of the filter because the filter is set to false. This mapping will always be successful and valid because all ports are connected a nothing makes it to the "dummy" target thus no bad file logs or failure, etc.
Let me know if you need any clarification and I can update this answer.
No, you always need a target for the mapping to be valid. But I would rather work with a flat file target instead of a database table, you'll have much less work to do.
If you're on Linux / Unix, you can even route the file to /dev/null (use folder:/dev/, file:null) so the file is not actually written to the filesystem.
And using one dummy port is the right way. As you have said, you need at least one port, even if you don't really use it.
As odd as this may sound (Unix systems): neither source, nor target need to exist.
Source (flat file): /dev/null, column DUMMY
Target (flat file): /dev/null, column DUMMY
And you don't need to use any databases for the session to succeed, nor use any filters. It runs.

Creating a DTS package that uses a stored procedure

We're trying to make a DTS package where it'll launch a stored procedure and capture the contents in a flat file. This will have to run every night, and the new file should overwrite the existing file.
This wouldn't normally be a problem, as we just plug in the query and it runs, but this time everything was complicated enough that we chose to approach it with a stored procedure employing temporary tables. How can I go about using this in a DTS package? I tried going the normal route with the Wizard and then plugging in EXEC BlahBlah.dbo... It did not care for that:
The Statement could not be parsed. Additional information: Invalid object name '#DestinyDistHS'. (Microsoft SQL Server Native Client 10.0)
Can anyone guide me in the right direction here?
Thanks.
Is it an option to simply populate a non-temp table in your SP, call it and select from the non temp table when exporting?
This is only an issue if you have multiple simultaneous calls to the stored procedure. In this case you can't save to a single table.
If you do have multiple simultaneous calls then you might be able to:
Create a temp table to hold results
Use INSERT INTO #TempTable EXEC YourProc
SELECT FROM #TempTable
You might need to do this in a more forgiving command line tool (like SQLCMD). It's not as fussy about metadata.

Storing a directory structure in database

In my rails app a user can have a directory structure which has folders and files in sub-folders. Which is the best way to store such data ??
Also, which database offers best way to do so?
You can store a directory tree in a single table using any SQL database, by making the table self-referential. A good example is the Windows Installer's Directory table, where you will see a structure like this:
Directory = primary key id field, typically an integer
Directory_Parent = "foreign key" id field, which points to the id of another Directory in the same table
Value = string containing the directory/folder name
Your file table would then have a foreign key referencing the Directory id. To find the full path, you must follow it up the chain and build up the path from the end (right), tacking each parent directory onto the front (left). For example, the file would point to Directory id '4' with the Value 'subfolder', then you fetch the parent's value 'folder', then the parents value again until you get to the root, creating a path like /root/folder/subfolder/filename.
If your database supports recursive queries (either Oracle's connect by or the standard recursive common table expressions) then a self referencing table is fine (it's easy to update and query).
If your DBMS does not support hierarchical queries, then Eimantas suggestion to use a preordered tree traversal scheme is probably the best way.
It's simple tree stored in sql. Either check the standard parent-child scheme or implement preordered tree traversal scheme (left-right).

Resources