Persisting Neo4j transactions to external storage - neo4j

I'm currently working on a new Java application which uses an embedded Neo4j database as its data store. Eventually we'll be deploying to a cloud host which has no persistent data storage available - we're fine while the app is running but as soon as it stops we lose access to anything written to disk.
Therefore I'm trying to come up with a means of persisting data across an application restart. We have the option of capturing any change commands as they come into our application and writing them off somewhere but that means retaining a lifetime of changes and applying them in order as an application node comes back up. Is there any functionality in Neo4j or SDN that we could leverage to capture changes at the Neo4j level and write them off to and AWS S3 store or the like? I have had a look at Neo4j clustering but I don't think that will work either from a technical level (limited protocol support on our cloud platform) or from the cost of an Enterprise licence.
Any assistance would be gratefully accepted...

If you have an embedded Neo4j, you should know where in your code you are performing an update/create/delete query in Neo, no ?
To respond to your question, Neo4j has a TransactionEventHandler (https://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/event/TransactionEventHandler.html) that captures all the transaction and tells you what node/rel has been added, updated, deleted.
In fact it's the way to implement triggers in Neo4j.
But in your case I will consider to :
use another cloud provider that allow you to have a storage
if not possible, to implement a hook on the application shutdown that copy the graph.db folder to a storage (do the opposite for the startup)
use Neo4j as a remote server, and to install it on a cloud provider with a storage.

Related

Causal-cluster-friendly implementation

I am building an application that uses the native neo4j JavaScript driver. I want to make sure that my code will work if we migrate to a causal cluster.
The online documentation doesn't seem to be clear about how to do this: I notice sparse references to things like "bookmarks" and "reading what you have written", etc. But how it all fits together is unclear.
Can someone please provide a synopsis?
To use causal cluster you will need to change :
1) the url connection : replace bolt://localhost:7687 by bolt+routing://localhost:7687
This will allow your application to make some LB query to the cluster, and be fault tolerant without doing anything else
2) When you open a new session, you should specified what you will do into this session, ie. READ or WRITE.
This will help the driver to choose the good server (ie a core or a replica server). Otherwise it assumes you will do some WRITE operations, and the driver will always choose a core server ...
3) because you will be on a cluster env., there is some lag (some secondes) for the propagation of an update inside the cluster.
Or sometimes, you need to read your own writes within two sessions. It's where you will need the bookmark functionality.
Documentation is here : https://neo4j.com/docs/developer-manual/current/drivers/
Cheers.

does neo4j embedded driver lock the db files?

I have a general question about the embedded driver for neo4j. What exactly does it mean to be embedded, besides it being lower level and higher performance. Is it an actual instance of the database service or just a driver for connecting to an existing database process or service. For instance
Does using the embedded driver libraries acquire an exclusive lock on the database files?
Can multiple clients use the embedded driver to use the same database at the same time?
Can it run against a database that already has a database service(along with the REST api) running? Initial tests seem to indicate no since it throws a file lock exception.
Does the embedded driver have to be on the same machine or process as the database service? For instance if the db data files are on a shared SAN that multiple machines can access, and there is another server that is running the REST api and the neo4j service. The configuration on the driver seems to point to the data files directly rather than a service or port.
I am using embedded Neo4j in a project.
Embedded Neo4j is a Neo4j server started and shutdown by your application. So it is not just a driver used to connect to some standalone server. For a standalone server you would use Neo4j over Rest (locally or remotely).
Because of it's implementation embedded neo4j can be used by only one application - the application that started the embedded instance. It retrieves a lock on the graph files, and you can't use any other application (e.g. neo4j-sh) to access those files as long the embedded server is running.

offline web application design recommendation

I want to know which is the best architecture to adopt for this case :
I have many shops that connect to a web application developed using Ruby on Rails.
internet is not reachable all the time
The solution was to develop an offline system which requires installing a local copy of the distant database.
All this wad already developed.
Now what I want to do :
Work always on the local copy of the database.
Any change on the local database should be synchronized with distant database.
All the local copies should have the same data in other local copies.
To resolve this problem I thought about using a JMS like software eventually Rabbit MQ.
This consists on pushing any sql request into a JMS queue that will be executed on the distant instance of the application which will insert into the distant DB and push the insert or SQL statement into another queue that will be read by all the local instances. This seems complicated and should slow down the application.
Is there a design or recommendation that I must apply to resolve this kind of problem ?
You can do that but essentially you are developing your own replication engine. Those things can be a bit tricky to get right (what happens if m1 and m3 are executed on replica r1, but m2 isn't?) I wouldn't want to develop something like that unless you are sure you have the resources to make it work.
I would look into existing off-the shelf replication solution. If you are already using a SQL DB it probably has some support for it. Look here for more details if you are using MySQL
Alternatively, if you are willing to explore other backends, I heard that CouchDB has great support for replication. I also heard of people using git libraries to do that sort of thing.
Update: After your comment, I realize you already use MySql replication and are looking for solution for re-syncing the databases after being offline.
Even in that case RabbitMQ doesn't help you at all since it requires constant connection to work, so you are back to square one. Easiest solution would be to just write all the changes (SQL commands) into a text file at a remote location, then when you get connection back copy that file (scp, ftp, emaill or whatever) to master server, run all the commands there and then just resync all the replicas.
Depending on your specific project you may also need to make sure there are no conflicts when running commands from different remote location but there is no general technical solution to this. Again, depending on the project, you may want to cancel one of the transactions, notify the users that it happened and so on.
I would recommend taking a look at CouchDB. It's a non-SQL database that does exactly what you are describing automatically. It's used especially in phone applications that often don't have internet or data connectivity. The idea is that you have a local copy of a CouchDB database and one or more remote CouchDB databases. The CouchDB server then takes care of teh replication of the distributed systems and you always work off your local database. This approach is nice because you don't have to build your own distributed replication engine. For more details I would take a look at the 'Distributed Updates and Replication' section of their documentation.

Have additional connections to Derby (read-only)

What I want to do: My application has a full connection to a Derby DB, and I want to poke around in the DB (read-only) in parallel (using a different tool).
I'm not sure how Derby actually works internally, but I understand that I can have only 1 active connection to a Derby DB.
However, since the DB is only consisting of files on my HDD, shouldn't I be able to open additional connections to it, in read-only mode?
Are there any tools to do just that?
There are two possibilities how to run Apache Derby DB.
Embedded: You run DB within your application → only one connection possible
Client: You start DB as server in separate process → classic DB with many connections
You can recognize the type upon driver size. If the driver has more then 2MB that you use embedded version.
Update
When you startup the derby engine (server or embedded) it gets exclusive access to database files.
If you need to access a single database from more than one Java Virtual Machine (JVM), you will need to put a server solution in place. You can allow applications from multiple JVMs that need to access that database to connect to the server.
For details see Double-booting system behavior.
I realize this is an old question, but I thought I might add a little more detail on a solution since links in the currently accepted answer are broken.
It is possible to run the Derby Network Server within a JVM that is using the embedded database already. The code that is using the embedded Derby database doesn't need to change anything and can keep using the DB as is, but with the Derby Network Server started, other programs can connect to derby and access the database.
All you need to do is ensure that derbynet.jar is on the classpath
And then you can do one of the following
Include the following line in the derby.properties file: derby.drda.startNetworkServer=true
Specify the property as a system property at java start
java -Dderby.drda.startNetworkServer=true
You can use the NetworkServerControl API to start the Network Server from a separate thread within a Java application:
NetworkServerControl server = new NetworkServerControl();
server.start (new PrintWriter(System.out));
More details here: http://db.apache.org/derby/docs/10.9/adminguide/tadminconfig814963.html
Keep in mind that doing this does not enable any security on this connection, so it is not a good idea to do this on a production system. It is possible to add security though and that is documented here: http://db.apache.org/derby/docs/10.9/adminguide/cadminnetservsecurity.html
Two other ideas:
In your application, shut down the database and close the connection when the database is not actively in use. Then your application won't interfere with another tool which is trying to open the database.
Make a copy of your database, by taking a backup (you can do this while the database is open by your application), then restore that backup to a separate place on your disk. Then you can use another tool to access the copied database at your ease.
If you can afford the memory and do not need up-to-date data, then you can access read-only databases from multiple JVMs by creating in-memory copies:
ij> connect 'jdbc:derby:memory:memdb;restoreFrom=mydb';

A way to synchronize data between an external device and a database?

In the application I am designing, I have to communicate with a device and store a history of data readings in a database. The device is essentially a sensor that spits out numbers via the serial port. The user end of the application is a RubyOnRails interface that allows the user to view this data and configure the device.
I am wondering what kind of connection between the database and the device you could recommend for this kind of a setup.
Up to this point, I had a custom application running on a host computer (a computer with the device connected directly through a serial port) that would serve as a bridge to a MySQL database. The application would connect directly to the MySQL database and execute queries. It works fairly well, but I am not sure if this is the best solution.
The only other alternative I see is to have an intermediate application that my custom application could connect to, instead of directly going to the database. This could be a part of the main application, or something separate. Would this be a better solution?
Would you recommend another approach?
Thank you,
I have a similar structure, although I fetch my data from a Web Service. The way I organize is:
Create classes in lib/imports, eg DailyDataImport, DailyDataSummarize (you can organize the hierarchy and names as per your wish or willingness).
Create a rake task under a new namespace, say import and add it to your cron job depending frequency. Take a look at Cron in Ruby. Its helpful.
This allows me to have a better control over what goes in my database.
Some questions to consider:
What schedule does the Device follow
to populate the data?
Do you need the data as-is or you
want a little control over it or you
need to process it, like summarizing
and aggregating etc.
MS SQL Server 2008 has great data synchronisation support.
SQL Server 2008 Express is free and can act as a replication subscriber (but not publisher) for clients.
Microsoft Sync Framework

Resources