Neo4j: Initial and Delta data load from SQL Database

Neo4j: Initial and Delta data load from SQL Database - neo4j

Currently, evaluating Neo4j for some data analytic. Where data from different database will be pushed to Neo4j Database periodically. These can have "Addition, modification as well as delete".
As the model how data store in GraphDB and and in Origin SQL DB by virtue differs, we are thinking/trying to find - how to handle the add, modify and delete scenarios?
Are there any standard Rules/Ways?
Are there any tolls for sync? (Other than the CSV or similar imports)
Thanks in advance

There are not ready-to-use solution in your case. You should build it by yourself.
#1 - Manual imports
You can prepare data and manually execute import (using standard tools). Then, when new data appears - manually prepare new data and import to existing database.
Import tool and Cypher csv can be used here.
#2 - Unmanaged extension
You can develop unmanaged extension that will be capable to persist data in Neo4j from your database. In this case some sort of sync should be implemented on client and server side.
More information can be found here.
#3 - neo4j-csv-firehose
There is extension developed by #sarmbruster - neo4j-csv-firehose.
neo4j-csv-firehose enables Neo4j’s LOAD CSV Cypher command to load
other from other datasources as well. It provides a Neo4j unmanaged
extension doing on-the-fly conversion of the other datasource to csv -
and can therefore act as input for LOAD CSV. Alternatively it can be
run as standalone server.
Check README for more information.
#4 - neo4j-shell-tools
This is another project developed by #jexp - neo4j-shell-tools.
neo4j-shell-tools adds a number of commands to neo4j-shell which
easily allow import and export data into running Neo4j database.
Check README for more information.
#5 - Liquigraph
Another interesting tool is - Liquigraph.
Database migrations management tool, based on how Liquibase works.
You can write migration for Neo4j database in XML using this tool.
Also, you check other existing neo4j tools - maybe something works for you.

Not really sure what you're asking for.
Usually you have a import script that imports into the graph model.
This can be cypher or java code and be driven by csv, json, or whatever your datasource is (provided as parameter).

Related

Aqueduct & in memory database

just wanted to know, if the Aqueduct ORM supports a simple in memory database, for testing purposes. Looking for something easy and lightweight to write the backend, before actually connecting it to postgres.

I've used similar approach with H2 and Postgres with Java, but it is rather error-prone: While the SQL interface may be similar, you could be using a feature that is available in one, but not the other. Eventually, either your development is blocked, or it is OK, but then the real deployment will be hitting issues.
I've found that starting a Postgresql instance in a docker is much easier than I've first thought, and now I use the same principle for most external dependencies: run them inside docker. If there is an interest, I can open source a Dart package that starts the docker container and waits until a certain string pattern is present on the output (e.g. a report on successful start).

Aqueduct was built to be tested with a locally running instance of PostgreSQL. This avoids the class of errors that occur when using a different database engine in tests vs. deployment. It is a very important feature of Aqueduct.
The tl;dr is that you can use a local instance of PostgreSQL with the same efficiency as an in-memory database and there is documentation on the one-time setup process.
The Details
Aqueduct creates an intermediate representation of your data model at startup by reflecting on your application code. This representation drives database migrations, serialization, runtime reflection, and can even be exported as JSON to create data modeling tools on top of Aqueduct.
At the beginning of each test suite, your test harness uses this representation to generate temporary tables in a local database named dart_test. Temporary tables are destroyed as soon as the database connection is lost; which you can configure to happen between tests, groups of tests, or entire test suites depending on your needs. It turns out that this is very fast - on the order of milliseconds.
CI platforms like TravisCI and Appveyor both support local PostgreSQL processes. See this script and this travis config for an example.

Simple inquiry about streaming data directly into Cloud SQL using Google DataFlow

So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!

You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

Does neo4j have a "trigger" mechanism via Cypher? (similar to percolators in ElasticSearch)

I am looking for a method to store cypher queries and when adding nodes and relations be notified when it matches said query? Can this be done currently? Something similar to ElasticSearch percolators would be great.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html

Update
The answer below was accurate in 2014. It's mostly accurate in 2018.
But now there is a way of implementing triggers in Neo4j provided by Max DeMarzi which is pretty good, and will get the job done.
Original answer below.
No, it doesn't.
You might be able to get something similar to what you want by using a TransactionEventHandler object, which basically lets you bind a piece of code (in java) to the processing of a transaction.
I'd be really careful with running cypher in this context though. Depending on what kind of matching you want to do, you could really slaughter performance by running that each time new data is created in the graph. Usually triggers in an RDBMS are specific to inserts or updates on a particular table. In Neo4J, the closest equivalent you might have is on creating/modifying a node of a certain type of label. If your app has any number of different node classes, it wouldn't make sense to run your trigger code whenever new relationships/nodes are created, because most of the time the node type probably wouldn't be relevant to the trigger code.
Related reading: Do graph databases support triggers? and a feature request for triggers in neo4j

Neo4j 3.5 supports triggers.
To use this functionality - Enable apoc.trigger.enabled=true in $NEO4J_HOME/config/neo4j.conf.
you have to add APOC to the server - it's not there by default.
In a trigger you register Cypher statements that are called when data in Neo4j is changed (created, updated, deleted). You can run them before or after commit.
Here is the help doc -
https://neo4j-contrib.github.io/neo4j-apoc-procedures/#triggers

Updating large amounts of data in Rails App

I have a rails app with a table of about 30 million rows that I build from a text document my data provider gives me quarterly. From there I do some manipulation and comparison with some other tables and create an additional table with a more customized data.
My first time doing this, I ran a ruby script through Rails console. This was slow and obviously not the best way.
What is the best way to streamline this process and update it on my production server without any, or at least very limited downtime?
This is the process I'm thinking is best for now:
create rake tasks for reading in the data. Use activerecord-import plugin to do batch writing and to turn off activerecord validations. Load this data into brand new, duplicate tables.
Build indexes on newly created tables.
Rename newly created tables to the names the rails app is looking for.
Delete the old.
All of this I'm planning on doing right on the production server.
Is there a better way to do this?
Other notes from comments:
Tables already exist
Old tables and data are disposable
Tables can be locked for select only
Must minimize downtime
Our current server situation is 2 High CPU Amazon EC2 instances. I believe they have 1.7GB of RAM so storing the entire import temporarily is probably not an option.
New data is raw text file, line delimited. I have the script for parsing it already written in Ruby.

1) create "my_table_new" as an empty clone of "my_table"
2) import the file (in batches of x lines) into my_new_table - indexes built as you go.
3) Run: RENAME TABLE my_table TO my_table_old, my_table_new TO my_table;
Doing this as one command makes it instant (close enough) so virtually no downtime. I've done this with large data sets, and as its the rename that's the 'switch' you should retain uptime.

Depending on your logic, I would seriously consider processing the data in the database using SQL. This is close to the data and 30m rows is typically not a thing you want to be pulling out of the database and comparing to other data you have also pulled out of the database.
So think outside of the Ruby on Rails box.
SQL has built-in capability to join data and compare data and insert and update tables, those capabilities can be very powerful and fast, allowing the data to be processed close to the data.

Making a connection between multiple databases

I'm using JAVA DB (derby)
I want to import a public view of my data to another database (also in java db).
I want to pass this data and save in to the other database. I'm having trouble since the general rule is one connection to one database.
Help would be much appreciated.

You need two connections, one to each database.
If you want the two operations to be a single unit of work, you should use XA JDBC drivers so you can do two-phase commit. You'll also need a JTA transaction manager.
This is easy to do with Spring.
SELECT from one connection; INSERT into the other. Just standard JDBC is what I'm thinking. You'll want to batch your INSERTs and checkpoint them if you have a lot of rows so you don't build up a huge rollback segment.
I'd wonder why you have to duplicate data this way. "Don't Repeat Yourself" would be a good argument against it. Why do you think you need it in two places like this?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Neo4j: Initial and Delta data load from SQL Database - neo4j

Not really sure what you're asking for. Usually you have a import script that imports into the graph model. This can be cypher or java code and be driven by csv, json, or whatever your datasource is (provided as parameter).

Related

Aqueduct & in memory database

Simple inquiry about streaming data directly into Cloud SQL using Google DataFlow

Does neo4j have a "trigger" mechanism via Cypher? (similar to percolators in ElasticSearch)

Updating large amounts of data in Rails App

Making a connection between multiple databases

Categories

Resources