I'm working on a big project, which is based on micro service architecture , so consider I have 10 service which some of them have their own database,
these databases are in different technologies (mysql, mongodb , elastic, ... )
so what is the best practice for backup and restore collection of services?
the real problem is these databases are related to each other, for example in my logic backend server I keep oauhId of each user which comes from oauth server,
now consider restore these two databases separately and now my users db in logic server contains some users which there aren't any related records to them on oauth server,
just for your information, I'm using docker , docker-compose, docker swarm for my service orchestration.
As an idea: check how your services depend on each other. If your dependencies are acyclic, you might be able to backup all your data outside-in or inside-out, without running into consistency issues.
Doing so would guarantee you to have no elements in services depending on an inner one after your restore.
If your services show cyclic dependencies, you might be better serviced to have each service redundantly (e.g. master slave replication). Then you can take down the slave instances, taking a backup from the whole lot of slaves while they are offline. That would allow you to create an atomic backup accross all services. However your quality of the backup is then based on the quality of your master slave replication at each service.
Lastly you could keep record of change per service, plus a full backup. Thus you can write your rollback and the start applying the record of change until you reach a consistent state accross the service instances. I think that requires you to have logical dependencies (request identifier) that allows you to correlate the record of change elements (i.e. apply them across the services without the risk to apply them in a way that defies the logical dependencies that occured when clients actually interacted with your services).
I hope these ideas can help you solve your problem :)
Related
Is it good practice to have multiple services connect to the same database server but each having their own database.
I guess having one Postgres instance is better than each service/container having their own instance.
My question is that should each service:
run in their own container with db server instance in the same container
run in their own container and the db for that service run on a separate container just for the db (multiple db containers/one per service)
one db SERVER, multiple databases, one container for all databases and all services connect this container/db server
I understand that each service should have their own db, but does that also means they should be completely decoupled even server wise.
I guess the reason I want to have one db SERVER is so that resources are not "wasted" as multiple instance of db server running
I also understand that having one server will mean that all services will coupled hardware wise
It doesn’t really matter. Modern infrastructures tend not to care about the overhead of running multiple copies of the same service. Since database I/O can often be a critical performance point, you might find it more manageable to not share a database, so that you can put databases under heavier load on dedicated and/or larger hardware.
(Also consider running your database(s) on dedicated hardware, not under Docker: they’re the one thing you must back up, and you’ll update them much less often than the rest of your application stack, so their lifecycle is fundamentally different from a disposable Docker container. If you’re using a public cloud service that offers a managed database service and are willing to pay for it, that can also be a very reasonable option.)
Whatever you decide, you almost definitely need to make all of the parameters (host, database, username, password) configurable, usually via environment variables. (I see too many SO questions that have host names hard-coded in source code.) You should be able to deploy the same image with different options in development, test, and production environments, which will generally have different host names.
Yes it is ok to have multiple services on the same server.
The deployment configuration should be guided by your operational needs (cost, performance monitoring etc.) as long as the services are decoupled you'd have the freedom to move the data around based on your operational needs
Well, you can consider these:
1) If your service doesn't care about your database server, I mean like, for Service A it must be MongoDB, Service B must be Ms. SQL Server, etc.
Then you can go with this setup:
One Database Server -> with, Multiple Databases -> where, One Database for each Service.
2) But, if you find it does matter, as I describe below:
Service A -> Postgres
Service B -> Postgres
Service C -> Ms. SQL Server
Service D -> MongoDB
then you will have 3 database servers where Service A & B will share the same Postgres database server (containing 2 databases, one for Service A, one for Service B) and Service C will have 1 Ms. SQL Server (containing, 1 database) also Service D MongoDB server (containing 1 database).
Usually, you will find this case while you are working with different teams (which handle each service) and each team decides on their own choices.
I am trying to build multiple API for which I want to store the data with Cassandra. I am designing it as if I would have multiple hosts but, the hosts I envisioned would be of two types: trusted and non-trusted.
Because of that I have certain data which I don't want to end up replicated on a group of the hosts but the rest of the data to be replicated everywhere.
I considered simply making a node for public data and one for protected data but that would require the trusted hosts to run two nodes and it would also complicate the way the API interacts with the data.
I am building it in a docker container also, I expect that there will be frequent node creation/destruction both trusted and not trusted.
I want to know if it is possible to use keyspaces in order to achieve my required replication strategy.
You could have two Datacenters one having your public data and the other the private data. You can configure keyspace replication to only replicate that data to one (or both) DCs. See the docs on replication for NetworkTopologyStrategy
However there are security concerns here since all the nodes need to be able to reach one another via the gossip protocol and also your client applications might need to contact both DCs for different reads and writes.
I would suggest you look into configuring security perhaps SSL for starters and then perhaps internal authentication. Note Kerberos is also supported but this might be too complex for what you need at least now.
You may also consider taking a look at the firewall docs to see what ports are used between nodes and from clients so you know which ones to lock down.
Finally as the above poster mentions, the destruction / creation of nodes too often is not good practice. Cassandra is designed to be able to grow / shrink your cluster while running, but it can be a costly operation as it involves not only streaming data from / to the node being removed / added but also other nodes shuffling around token ranges to rebalance.
You can run nodes in docker containers, however note you need to take care not to do things like several containers all accessing the same physical resources. Cassandra is quite sensitive to io latency for example, several containers sharing the same physical disk might render performance problems.
In short: no you can't.
All nodes in a cassandra cluster from a complete ring where your data will be distributed with your selected partitioner.
You can have multiple keyspaces and authentication and authorziation within cassandra and split your trusted and untrusted data into different keyspaces. Or you an go with two clusters for splitting your data.
From my experience you also should not try to create and destroy cassandra nodes as your usual daily business. Adding and removing nodes is costly and needs to be monitored as your cluster needs to maintain repliaction and so on. So it might be good to split cassandra clusters from your api nodes.
I have an data processing application which is updated on a regular basis. This application has a bunch of dependencies which are also updated every now and then. However, different versions of the software (+dependencies) might produce different results (this is expected). The application is run on a remote computer and it can be accessed through a Web page. Every time the user uses the Web page to do some processing she/he also chooses which version of the software he/she wants to use.
Now I am trying to decide which is the best way of keeping track different software (+dependencies) versions. The simplest way of course is to just compile and install each version of my software and its dependencies in a different folder, and then based on the request the user sends, the appropriate folder is selected. However, this sounds very clunky to me. So I thought I could use Docker to keep track of the different software versions. Do you think that it is a good idea? If yes, what is most appropriate to do every time I have a new version of the software (and/or dependencies): 1) Create a new container from scratch with the new version (and end up having multiple containers), or 2) Update the existing container and commit the changes? (I suppose I can access the older commits of the container, right?)
PS: Keep in mind that the reason I looked into Docker and not a simple virtual machine solution is that the application I am running is a high-performance GPU-based software.
Docker is a reasonable choice. Your repository would contain all of the app versions you wish to publish. Note, you will only realize savings if you organize the resulting app filesystem into layers, of which the lower layers are the least likely to change between versions. This will keep the storage requirements at a minimum.
Then you have to decide how you will process each job. A robust (but complex) solution would be to have one or more API containers which take in processing jobs from your user and "dole" them out to worker containers (one or more from each release version). This would provide the lowest response latency and be non-blocking. You can look at different service discovery models to see how your "worker" containers can register with your "manager" containers. This is probably more than you'd like to bite off, but consider using a good key-value database (another container!) like etcd or a 3rd party service discovery tool like zookeeper/eureka/consul.
A much simpler model would have a single API container with one each of the release containers created, but not started. The API container would start, direct, and then stop the appropriate release container. You would incur the startup latency, but this is the least resource intensive... and easiest to manage. But this is a blocking operation.
Somewhere in the middle, but less user friendly is to have each release container running but listening on different host ports (the app always sees the same port). The user would would connect to the port which is servicing the desired release of the app. You'd have to provide some sort of index to make this useful.
We use Mongodb as the central Database for our application; a consumer facing mobile app. At present its a 7-member replica-set with replica-set-1 being the master at the moment. The backend which connects to the mongo replica is build in Ruby on Rails and we use mongoid as the ODM.
There are mainly 3 pieces connecting to the MongoDB replica-set.
The consumer application
The Admin and customer care management application
The Data retrieval application ( for analytics and such purposes )
All these 3 apps connect to the same replica set as of now.
What I would like to know is whether is it possible to connect different applications to specific replicas.
For example, the mobile app connects to the primary for writes and the replicas 2-4
to read; the customer care management application connects to the primary
( for writes ) and replicas 5-7 for reads.
I dont think explicitly mentioning specific replicas in the mongoid.yml configuration is working. Even though I have already mentioned only replica-set-7 in the mongoid hosts file for the data retrieval application, I do see certain queries in the log file of replica-set-2 and 3.
So obviously, MongoDB decides the criteria to distribute the queries among its replicas despite the configuration specified at the client mongoid end.
I would really love to know if such a thing is possible at all using MongoDb and mongoid as it would help us solve a lot of our load issues. Right now heavy queries from the customer care and data retrieval apps affects the consumer facing mobile app as well; as the reads are not segragated. So basically would like to separate out the reads.
Also, if at all this is possible, I would again have my eyes raised on any possible pitfalls for this; specially that all 3 applications can write to the DB. For example, replica-3 suddenly becomes the primary after an election and its not explicitly mentioned in the configuration of the data retrieval application. What might happen there would become a concern.
I am not at all sure whether this is possible; but just wanted to know if theres a way to figure out this. Any help would be really appreciable.
When you connect to any member of a replica set, the client is told the full state of the replica set and can connect to any of them. The initial set of hosts are just the seeds for that process - as long as your application can reach one of those hosts, it doesn't matter which hosts are in that configuration.
Mongo does have the concept of tagged replica set members. When creating a connection or executing a query you can specify the tags to use to select the replica set member to read from.
I want to know which is the best architecture to adopt for this case :
I have many shops that connect to a web application developed using Ruby on Rails.
internet is not reachable all the time
The solution was to develop an offline system which requires installing a local copy of the distant database.
All this wad already developed.
Now what I want to do :
Work always on the local copy of the database.
Any change on the local database should be synchronized with distant database.
All the local copies should have the same data in other local copies.
To resolve this problem I thought about using a JMS like software eventually Rabbit MQ.
This consists on pushing any sql request into a JMS queue that will be executed on the distant instance of the application which will insert into the distant DB and push the insert or SQL statement into another queue that will be read by all the local instances. This seems complicated and should slow down the application.
Is there a design or recommendation that I must apply to resolve this kind of problem ?
You can do that but essentially you are developing your own replication engine. Those things can be a bit tricky to get right (what happens if m1 and m3 are executed on replica r1, but m2 isn't?) I wouldn't want to develop something like that unless you are sure you have the resources to make it work.
I would look into existing off-the shelf replication solution. If you are already using a SQL DB it probably has some support for it. Look here for more details if you are using MySQL
Alternatively, if you are willing to explore other backends, I heard that CouchDB has great support for replication. I also heard of people using git libraries to do that sort of thing.
Update: After your comment, I realize you already use MySql replication and are looking for solution for re-syncing the databases after being offline.
Even in that case RabbitMQ doesn't help you at all since it requires constant connection to work, so you are back to square one. Easiest solution would be to just write all the changes (SQL commands) into a text file at a remote location, then when you get connection back copy that file (scp, ftp, emaill or whatever) to master server, run all the commands there and then just resync all the replicas.
Depending on your specific project you may also need to make sure there are no conflicts when running commands from different remote location but there is no general technical solution to this. Again, depending on the project, you may want to cancel one of the transactions, notify the users that it happened and so on.
I would recommend taking a look at CouchDB. It's a non-SQL database that does exactly what you are describing automatically. It's used especially in phone applications that often don't have internet or data connectivity. The idea is that you have a local copy of a CouchDB database and one or more remote CouchDB databases. The CouchDB server then takes care of teh replication of the distributed systems and you always work off your local database. This approach is nice because you don't have to build your own distributed replication engine. For more details I would take a look at the 'Distributed Updates and Replication' section of their documentation.