How to dump contents of membase db to seed another cluster - membase

I am using membase 1.7.1 server cluster of 3 machines (vbuckets only), and would like to be able to back up the contents for the -- presumably unlikely -- case that the entire cluster goes down.
I periodically get new data from my provider; I want to keep the old data around more or less indefinitely, and add the new data. Imagine a wine rating application. New vintages come in all the time, but I need to keep the old ones around.
Currently I have a process which does the following:
Download some data from 3rd party provider
Push data into my vbucket; some old data may be overwritten, some data will be new
Hang out until next data update; other processes will be reading the data
What I'd like to do is:
See if my bucket has any data in it
If it doesn't, load from offline storage (see step #5)
Download some data from 3rd party provider
Push data into my vbucket; some old data may be overwritten, some data will be new
Take dump of all data into offline storage
Hang out until next data update; other processes will be reading the data
Steps 1,2, 5 are new.
So the question is about step #5. Is TAP protocol a good way to dump out the contents of my membase bucket? Will it interfere with readers?

The membase documentation recommends the mbbackup tool for backup, which is invoked manually from command-line outside of your application. The data dumped can be restored via mbrestore. The target of mbrestore can be a cluster that's different from the original cluster you ran mbbackup on.
Reference: http://www.couchbase.org/wiki/display/membase/Membase+Server+version+1.7.1+and+up
If you're on AWS, you can run membase on EBS and have the option of snapshotting the EBS volumes over to Amazon S3 periodically.
Reference: http://couchbase.org/forums/thread/correct-way-back-aws-membase-ebs

Related

How to separate application and data syncing implementations in Kubernetes?

I want to build an application/service which uses a static application data which can get updated over time. Currently, I implemented this by having both the application and the data within the same container. However, this requires redeployment when either application or data change. I want to separate the app and data volume implementation so that I can update the application and the data independently, meaning I won't have to rebuild the application layer when the application data is updated and vice versa.
Here are the characteristics of the Application Data and its Usage:
Data is not frequently updated, but read very frequently by the application
Data is not a database, it's a collection of file objects with size ranging from 100MB to 4GB which is initially stored in a cloud storage
Data stored in the cloud storage serves as a single source of truth for the application data
The application will only read from the Data. The process of updating data in cloud storage is an external process outside the application scope.
So here, we are interested in sync-ing the data in cloud storage to the volume in Kubernetes deployment. What's the best way to achieve this objective in Kubernetes?
I have several options in mind:
Using one app container in one deployment, in which the app will also include the logic for data loading and update which pulls data
from cloud storage to the container --> simple but tightly coupled with the storage read-write implementation
Using the cloud store directly from the app --> this doesn't require container volume, but I was concerned with the huge file size because the app is an interactive service which requires a quick response
Using two containers in one deployment sharing the same volume --> allow great flexibility for the storage read-write implementation
one container for application service reading from the shared volume
one container for updating data and listening to update data request which writes data to the shared volume --> this process will pull data from cloud storage to the shared volume
Using one container with a Persistent Disk
an external process which writes to the persistent disk (not sure how to do this yet with cloud storage/file objects, need to find a way to sync gcs to persistent disk)
one container for application service which reads from the mounted volume
Using Fuse mounts
an external process which writes to cloud storage
a container which uses fuse mounts
I am currently leaning towards option 3, but I am not sure if it's the common practice of achieving my objective. Please let me know if you have better solutions.
Yes. 3. is the most common option but make sure you use an initContainer to copy the data from your cloud storage to a local volume. That local volume could be any of the types supported by Kubernetes.

Persisting Neo4j transactions to external storage

I'm currently working on a new Java application which uses an embedded Neo4j database as its data store. Eventually we'll be deploying to a cloud host which has no persistent data storage available - we're fine while the app is running but as soon as it stops we lose access to anything written to disk.
Therefore I'm trying to come up with a means of persisting data across an application restart. We have the option of capturing any change commands as they come into our application and writing them off somewhere but that means retaining a lifetime of changes and applying them in order as an application node comes back up. Is there any functionality in Neo4j or SDN that we could leverage to capture changes at the Neo4j level and write them off to and AWS S3 store or the like? I have had a look at Neo4j clustering but I don't think that will work either from a technical level (limited protocol support on our cloud platform) or from the cost of an Enterprise licence.
Any assistance would be gratefully accepted...
If you have an embedded Neo4j, you should know where in your code you are performing an update/create/delete query in Neo, no ?
To respond to your question, Neo4j has a TransactionEventHandler (https://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/event/TransactionEventHandler.html) that captures all the transaction and tells you what node/rel has been added, updated, deleted.
In fact it's the way to implement triggers in Neo4j.
But in your case I will consider to :
use another cloud provider that allow you to have a storage
if not possible, to implement a hook on the application shutdown that copy the graph.db folder to a storage (do the opposite for the startup)
use Neo4j as a remote server, and to install it on a cloud provider with a storage.

How does Redis persist data on my local Apache server even after reboot and complete power down?

From how I understand, Redis uses in memory from which I gather my RAM if I am running a local apache development server. I tried powering down my computer and disconnected the power cable as well, but the redis data on my local server development website persisted when I powered back up my computer and tested my test website again. I thought RAM data gets completely wiped when I do a system reboot, how does Redis persist data even after reboot on my local development environment? Thanks! :)
Redis serves data only out of RAM, but it provides two modes of persistence RDB (snapshot persistence) and AOF (changelog persistence). If either mode of persistence is enabled on your Redis server, then your data will persist between reboots.
The config directives you want to check are:
appendonly yes
save
More information on Redis Persistence here.
Redis has persistence options that saves Redis data in either RDB or AOF format (basically saving the Redis data to a file/log):
The RDB persistence performs point-in-time snapshots of your dataset at specified intervals.
The AOF persistence logs every write operation received by the server, that will be played again at server startup, reconstructing the original dataset. Commands are logged using the same format as the Redis protocol itself, in an append-only fashion. Redis is able to rewrite the log on background when it gets too big.
If you wish, you can disable persistence at all, if you want your data to just exist as long as the server is running.
It is possible to combine both AOF and RDB in the same instance. Notice that, in this case, when Redis restarts the AOF file will be used to reconstruct the original dataset since it is guaranteed to be the most complete.
This info was quoted from https://redis.io/topics/persistence, which goes into detail about these options.
You can read more from the Antirez weblog: Redis Persistence Demystified

Delphi - Folder Synchronization over network

I have an application that connects to a database and can be used in multi-user mode, whereby multiple computers can connect the the same database server to view and modify data. One of the clients is always designated to be the 'Master' client. This master also receives text information from either RS232 or UDP input and logs this data every second to a text file on the local machine.
My issue is that the other clients need to access this data from the Master client. I am just wondering the best and most efficient way to proceed to solve this problem. I am considering two options:
Write a folder synchronize class to synchronize the folder on the remote (Master) computer with the folder on the local (client) computer. This would be a threaded, buffered file copying routine.
Implement a client/server so that the Master computer can serve this data to any client that connects and requests the data. The master would send the file over TCP/UDP to the requesting client.
The solution will have to take the following into account:
a. The log files are being written to every second. It must avoid any potential file locking issues.
b. The copying routine should only copy files that have been modified at a later date than the ones already on the client machine.
c. Be as efficient as possible
d. All machines are on a LAN
e. The synchronization need only be performed, say, every 10 minutes or so.
f. The amount of data is only in the order of ~50MB, but once the initial (first) sync is complete, then the amount of data to transfer would only be in the order of ~1MB. This will increase in the future
Which would be the better method to use? What are the pros/cons? I have also seen the Fast File Copy post which i am considering using.
If you use a database, why the "master" writes data to a text file instead of to the database, if those data needs to be shared?
Why invent the wheel? Use rsync instead. Package for windows: cwrsync.
For example, on the Master machine install rsync server, and on the client machines install rsync clients or simply drop files in your project directory. Whenever needed your application on a client machine shall execute rsync.exe requesting to synchronize necessary files from the server.
In order to copy open files you will need to setup Windows Volume Shadow Copy service. Here's a very detailed description on how the Master machine can be setup to allow copying of open files using Windows Volume Shadow Copy.
Write a web service interface, so that the clients an connect to the server and pull new data as needed. Or, you could write it as a subscribe/push mechanism so that clients connect to the server, "subscribe", and then the server pushes all new content to the registered clients. Clients would need to fully sync (get all changes since last sync) when registering, in case they were offline when updates occurred.
Both solutions would work just fine on the LAN, the choice is yours. You might want to also consider those issues related to the technology you choose:
Deployment flexibility. Using file shares and file copy requires file sharing to work, and all LAN users might gain access to the log files.
Longer term plans: File shares are only good on the local network, while IP based solutions work over routed networks, including Internet.
The file-based solution would be significantly easier to implement compared to the IP solution.

offline web application design recommendation

I want to know which is the best architecture to adopt for this case :
I have many shops that connect to a web application developed using Ruby on Rails.
internet is not reachable all the time
The solution was to develop an offline system which requires installing a local copy of the distant database.
All this wad already developed.
Now what I want to do :
Work always on the local copy of the database.
Any change on the local database should be synchronized with distant database.
All the local copies should have the same data in other local copies.
To resolve this problem I thought about using a JMS like software eventually Rabbit MQ.
This consists on pushing any sql request into a JMS queue that will be executed on the distant instance of the application which will insert into the distant DB and push the insert or SQL statement into another queue that will be read by all the local instances. This seems complicated and should slow down the application.
Is there a design or recommendation that I must apply to resolve this kind of problem ?
You can do that but essentially you are developing your own replication engine. Those things can be a bit tricky to get right (what happens if m1 and m3 are executed on replica r1, but m2 isn't?) I wouldn't want to develop something like that unless you are sure you have the resources to make it work.
I would look into existing off-the shelf replication solution. If you are already using a SQL DB it probably has some support for it. Look here for more details if you are using MySQL
Alternatively, if you are willing to explore other backends, I heard that CouchDB has great support for replication. I also heard of people using git libraries to do that sort of thing.
Update: After your comment, I realize you already use MySql replication and are looking for solution for re-syncing the databases after being offline.
Even in that case RabbitMQ doesn't help you at all since it requires constant connection to work, so you are back to square one. Easiest solution would be to just write all the changes (SQL commands) into a text file at a remote location, then when you get connection back copy that file (scp, ftp, emaill or whatever) to master server, run all the commands there and then just resync all the replicas.
Depending on your specific project you may also need to make sure there are no conflicts when running commands from different remote location but there is no general technical solution to this. Again, depending on the project, you may want to cancel one of the transactions, notify the users that it happened and so on.
I would recommend taking a look at CouchDB. It's a non-SQL database that does exactly what you are describing automatically. It's used especially in phone applications that often don't have internet or data connectivity. The idea is that you have a local copy of a CouchDB database and one or more remote CouchDB databases. The CouchDB server then takes care of teh replication of the distributed systems and you always work off your local database. This approach is nice because you don't have to build your own distributed replication engine. For more details I would take a look at the 'Distributed Updates and Replication' section of their documentation.

Resources