I am currently setting up a master-slave app using Ruby on Rails. I am planning to use data-fabric or octopus gem for handling the read/write connections.
This is my first time setting up master-slave DBs. I am confused over the various open source tools available to implement the postgresql replication e.g. pgpool II, pgcluster, Bucardo and Hot Standby/Streaming Replication (built in feature in postgresql 9.1)
My requirements are
fault tolerance(high availability and no data loss on failover)
load balancing
Thanks in advance
Note: I have gone through the stackoverflow post regarding postgresql replication but they are pretty old and not helping to conclude on which tool I should go with.
In your case, streaming replication is the place to start. It is not very flexible but it does what you need regarding database reads as long as you don't need to replicate between major versions.
Database Replication 101
Database replication is a way to ensure that data saved to a specific server becomes stored in a number of other servers. This is often done to better utilize more limited network connections, ensure fault tolerance (so there is essentially a hot back-up), ensure that read-only queries can be distributed over a larger number of databases, etc. This all must be done without sacrificing the the basic guarantees of ACID.
There are a number of different overlapping ways to categorize replication solutions. These include:
Page or file-level vs row-level vs statement-level
Synchronous vs Asynchronous
Master-slave vs Multi-Master
In general understanding replication and the tradeoffs between solutions requires relatively strong understanding of database mechanics and ACID guarantees. I will assume you are relatively familiar with storage mechanics, and deterministic vs non-deterministic operations and the like.
What is Being Replicated? File changes (Physical) vs Row Changes (Logical) vs Statements
The simplest approach is to replicate block changes to files, for example as stored in the write-ahead log in PostgreSQL. This replicates changes at the page level and it requires identical file formats. This means you cannot replicate across major versions, CPU architectures, or operating systems. Anything that could affect the alignment of tuples, for example, will cause the replication to either fail or, worse, corrupt the slave's database. This is the approach streaming replication uses. It is simple to set up, and it always replicates everything in the database cluster.
Additionally this approach means you can easily guarantee that the master and slave databases are identical down to the file level. Because of the fact that the PostgreSQL WAL is cluster-global it is unlikely that this approach will ever replicate anything short of the entire database cluster.
As a description of how this works, suppose I:
UPDATE my_table SET rand_value = random() WHERE id > 10000;
In this case, this changes a bunch of data pages and the file operations are replicated to the replicas. The files remain identical between the master and slave.
Another approach, one taken by Slony, Bucardo, and others is to replicate rows in a logical manner. In this approach, changed rows are flagged and logged, and the changes sent to the replicas. The replicas re-run row operations from the master database. Because these are add-on tools which do not replicate file operations but rather logical database operations, they can replicate across CPU architectures, operating systems, etc. Also they are usually designed so that you can replicate some but not all tables in a database, allowing for a lot of flexibility. On the other hand this leads to a lot of potential for errors. "Oops, that table was not replicated" is a real problem.
In this case when I run the update statement above, a trigger is fired capturing the actual rows inserted and deleted and these are logged, replicated, and the row operations re-run. Because this happens after rand() is run, the databases are logically, but not necessarily physically identical.
A final approach is statement replication. In this case we replicate statements and re-run the statements on the replicas. Some configurations of PgPool will do this. In this case, you cannot ensure that a database is logically equivalent to its replica if any non-deterministic functions are run. In the statement above, the statement itself will run on each replica, ensuring different pseudorandom numbers in the relevant column.
Synchronous vs Asynchronous
This distinction is important to understand regarding failover guarantees. In an asynchronous replication system, the updates are queued and transferred when possible to the replicas and re-run there. In a synchronous replication system the database which accepts the write will not return a successful commit until at least a certain number of replica databases report a successful commit.
Asynchronous replication is generally more robust and produces better availability than synchronous replication. This is because synchronous replication introduces additional points of failure. If you have one master and one slave, then if either system goes down, your database becomes unavailable at least for write operations.
The tradeoff though is that synchronous replication offers a guarantee that data which is committed is in fact available on replicas in the event that the master, say, suffers catastrophic hardware failure immediately following commit. This is a very low probability event, but in some cases it is important that you know the data is still available. In short this provides additional durability guarantees not present in async replication.
Multi-Master vs Master-Slave
Most replication systems are master-slave. In this case, all writes begin at one node and are replicated to other nodes. Writes may only begin at one node. They may not begin at other nodes. This makes replication straight-forward because we know that the slaves represent a past state of the master.
Multi-master replication allows writes to occur to more than one node. In an asynchronous replication system, this leads to the problem of conflict resolution. These problems are actually worse than most assume when you add DDL statements. Suppose two different users run the above update statement on two different masters. We will now have a set of records that have to be replicated across but they will conflict.
Multi-master replication typically requires that people think through this conflict resolution process quite carefully. It is never a process that just works out of the box. Often times you write your own conflict resolution routines. For this reason I typically recommend avoiding multi-master replication unless you really need it.
Related
I have an existing system that uses a relational DBMS. I am unable to use a NoSQL database for various internal reasons.
The system is to get some microservices that will be deployed using Kubernetes and Docker with the intention to do rolling upgrades to reduce downtime. The back end data layer will use the existing relational DBMS. The micro services will follow good practice and "own" their data store on the DBMS. The one big issue with this seems to be how to deal with managing the structure of the database across this. I have done my research:
https://blog.philipphauer.de/databases-challenge-continuous-delivery/
http://www.grahambrooks.com/continuous%20delivery/continuous%20deployment/zero%20down-time/2013/08/29/zero-down-time-relational-databases.html
http://blog.dixo.net/2015/02/blue-turquoise-green-deployment/
https://spring.io/blog/2016/05/31/zero-downtime-deployment-with-a-database
https://www.rainforestqa.com/blog/2014-06-27-zero-downtime-database-migrations/
All of the discussions seem to stop around the point of adding/removing columns and data migration. There is no discussion of how to manage stored procedures, views, triggers etc.
The application is written in .NET Full and .NET Core with Entity Framework as the ORM.
Has anyone got any insights on how to do continious delivery using a relational DBMS where it is a full production system? Is it back to the drawing board here? In as much that using a relational DBMS is "too hard" for rolling updates?
PS. Even though this is a continious delivery problem I have also tagged with Kubernetes and Docker as that will be the underlying tech in use for the orchestration/container side of things.
All of the following under the assumption that I understand correctly what you mean by "rolling updates" and what its consequences are.
It has very little (as in : nothing at all) to do with "relational DBMS". Flatfiles holding XML will make you face the exact same problem. Your "rolling update" will inevitably cause (hopefully brief) periods of time during which your server-side components (e.g. the db) must interact with "version 0" as well as with "version -1" of (the client-side components of) your system.
Here "compatibility theory" (*) steps in. A "working system" is a system in which the set of offered services is a superset (perhaps a proper superset) of the set of required services. So backward compatibility is guaranteed if "services offered" is never ever reduced and "services required" is never extended. However, the latter is typically what always happens when the current "version 0" is moved to "-1" and a new "current version 0" is added to the mix. So the conclusion is that "rolling updates" are theoretically doable as long as the "services" offered on server side are only ever extended, and always in such a way as to be, and always remain, a superset of the services required on (any version currently in use on) the client side.
"Services" here is to be interpreted as something very abstract. It might refer to a guarantee to the effect that, say, if column X in this row of this table has value Y then I will find another row in that other table using a key computed such-and-so, and that other row might be guaranteed to have column values satisfying this-or-that condition.
If that "guarantee" is introduced as an expectation (i.e. requirement) on (new version of) client side, you must do something on server side to comply. If that "guarantee" is currently offered but a contradicting guarantee is introduced as an expectation on (new version of) client side, then your rolling update scenario has by definition become inachievable.
(*) http://davidbau.com/archives/2003/12/01/theory_of_compatibility_part_1.html
There are also parts 2 and 3.
I work in an environment that achieves continuous delivery. We use MySQL.
We apply schema changes with minimal interruption by using pt-online-schema-change. One could also use gh-ost.
Adding a column can be done at any time if the application code can work with the extra column in place. For example, it's a good rule to avoid implicit columns like SELECT * or INSERT with no columns-list clause. Dropping a column can be done after the app code no longer references that column. Renaming a column is trickier to do without coordinating an app release, and in this case you may have to do two schema changes, one to add the new column and a later one to drop the old column after the app is known not to reference the old column.
We do upgrades and maintenance on database servers by using redundancy. Every database master has a replica, and the two instances are configured in master-master (circular) replication. So one is active and the other is passive. Applications are allowed to connect only to the active instance. The passive instance can be restarted, upgraded, etc.
We can switch the active instance in under 1 second by changing an internal CNAME, and updating the read_only option in each MySQL instance.
Database connections are terminated during this switch. Apps are required to detect a dropped connection and reconnect to the CNAME. This way the app is always connected to the active MySQL instance, freeing the passive instance for maintenance.
MySQL replication is asynchronous, so an instance can be brought down and back up, and it can resume replicating changes and generally catches up quickly. As long as its master keeps the binary logs needed. If the replica is down for longer than the binary log expiration, then it loses its place and must be reinitialized from a backup of the active instance.
Re comments:
how is the data access code versioned? ie v1 of app talking to v2 of DB?
That's up to each app developer team. I believe most are doing continual releases, not versions.
How are SP's, UDF's, Triggers etc dealt with?
No app is using any of those.
Stored routines in MySQL are really more of a liability than a feature. No support for packages or libraries of routines, no compiler, no debugger, bad scalability, and the SP language is unfamiliar and poorly documented. I don't recommend using stored routines in MySQL, even though it's common in Oracle/Microsoft database development practices.
Triggers are not allowed in our environment, because pt-online-schema-change needs to create its own triggers.
MySQL UDFs are compiled C/C++ code that has to be installed on the database server as a shared library. I have never heard of any company who used UDFs in production with MySQL. There is too a high risk that a bug in your C code could crash the whole MySQL server process. In our environment, app developers are not allowed access to the database servers for SOX compliance reasons, so they wouldn't be able to install UDFs anyway.
I am planning to configure some sort of 2 node replication for neo4j, similar to mysql replication. Since I am a little constrained on resources I don't want to pay for more than two Cloud compute instances. Also I am happy with just one real time or near real time copy of the neo4j database. So the approach i can think of is:
Configure HA on the two compute nodes with the help of an arbiter instance. Setup one neo4j instance (master) on first node and another neo4j instance (slave) + another neo4j instance (arbiter, only for arbitration, no data logging) instance on second node.
OR
Setup a cron for online backup using the neo4j-backup tool. Setup incremental backups every hour or so. Not sure the load it may put on the prod server, planning to test that out.
I am more inclined on the first approach since I get a more real time copy the database (I also get HA/load balancing with instant failover but that is not a priority right now).
Please let me know
which of the two approach is better,
if there is another way to achieve the same or
if any of the above approaches are not suitable or have some flaws.
I am a little new to Neo4j HA so please pardon me for my ignorance. Thanks !
So. You already mentioned available solutions.
TL;DR; I prefer first option.
Cluster
In general, recommended layout is 3 nodes (2 slaves + 1 master).
But your layout - 2 nodes (1 master + 1 slave + 1 arbiter) is viable too. Especially if one server can handle your workload.
Good things:
Almost "real-time" replica.
Possibility to utilise resources to handle bigger workload.
Better availability.
Notes:
If you have 10mb/sec write load on master, then same load will be applied on slave node. This shouldn't affect reads from slave at all (except write load is REALLY huge).
Maintenance costs are bigger, then single-instance installation. You should plan how to handle cluster upgrades, configuration updates, plugin updates.
Branched data. In clustered environment there is possibility to end up in "split-brain" scenario, when 2 nodes have different data and decision should be made which data should be kept. Neo4j handles such cases quite good. But you should keep in mind that small data-loss can occur in VERY RARE scenarios.
Backup
Good things:
Simple. Just do backups from database.
Consistency check. When backup is made, tool runs consistency check to verify if database is not damaged. There is no possibility that Backup will screw up live database. If there any issues - you will be notified via logs from backup utility. See below detailed info on to how backup is performed.
Database. Neo4j backup is fully-functional database. You can spin-up server that points to backup database, and do everything you wan't.
Incremental backups. You can do incremental backups as often, as you wan't.
Notes:
Neo4j scales vertically very well (depends on size of database). It can handle huge load on single instance (we had up to 3k requests/second on medium machine). So, you can get one bigger machine for Neo4j server and other smaller (cheaper) for backups.
How backup is performed?
One thing that should be kept in mind - live database is still fully operational. Backup utility doesn't not stop or prevent any actions.
When transaction in database is committed, all changes are appended to transaction log.
When there are no previous backup present: copy whole storage.
When there is previous backup AND transaction logs are available: copy new transaction logs and replay them on to storage.
When there is previous backup AND transactions are NOT available: discard existing storage, copy existing storage.
Why transaction logs can not be available? Your configuration may say to keep only latest transaction logs (i.e. 1 hour), or not to keep at all.
Relevant settings:
keep_logical_logs
logical_log_rotation_threshold
Other
Anyway, you should consider making backups event in clustered environment. Everything can fail, in any moment.
In general - everything depends on your load and database size.
If your database is small enough to fully fit in memory and one machine is enough to handle all load, then one Neo4j instance will be enough. Just do backup.
If you wan't better scalability/availability and real-time working replica, then cluster setup is best choice.
I have an erlang application currently running on four nodes with a replicated mnesia db that stores minimal data regarding connected clients. The mnesia replication has been working seamlessly in the past (as far as I know anyway) but a client recently noticed that one of the nodes is missing some ids related to his application.
I'm not really sure how this happened. Our network may have had a hiccup at the time. Maybe? But, of more urgency at the moment is getting the data into a good state across all nodes. Is there a way to tell mnesia to replicate from a known-good node?
Mnesia is legendary about this issue. It's a huge PITA.
Looking at it from CAP theorem's point of view, most systems built with Mnesia end up being C-A (consistency-availability with no partition tolerance) systems. For most of the time you have (and heavily rely on) its hard consistency. Then a network partition happens...
It's still available for writes, but these writes destroy consistency. And later on, Mnesia has no mechanism for automatic data repair.
Everyone who uses Mnesia in a cluster should familiarize themselves with these tradeoffs. Your problem is a clear sign that using Mnesia was a poor choice. Double so if this data is critical to you.
I too use Mnesia in such a way (sometimes we all need speed you know). But I make sure to only use it to store data that I can easily reconstruct. In general, if you need it stored on disk, Mnesia is no good, except for toy projects.
I make sure to always have this function at hand:
reinit_mnesia_cluster() ->
rpc:multicall(mnesia, stop, []),
AllNodes = [node() | nodes()],
mnesia:delete_schema(AllNodes),
mnesia:create_schema(AllNodes),
rpc:multicall(mnesia, start, []).
Use it only after the network partition has been resolved and all nodes are reachable. This will erase all Mnesia replicas and start it anew. Again, if you can't live with what it does, then using Mnesia was a poor choice.
For important data that needs hard consistency, use SQL. For important data that needs availability, use Riak. For shared state that needs speed, use Redis. Mnesia is no replacement for these systems, although at first it does seem so.
Edit on 2014-11-16: Here is a much better article on the topic, explaining in detail what I said above https://medium.com/#jlouis666/mnesia-and-cap-d2673a92850
Honestly, I think the cleanest way to get an out-of-sync Mnesia to replicate from a known good node is to shut down the application on the bad node, and delete all its Mnesia database files, then do the following.
Write an escript that starts Mnesia up standalone using the "bad" node name and Mnesia directory, replicates the tables from a known good node, and shuts Mnesia down. Run that escript on the bad node.
The act of replicating the tables and shutting Mnesia down gracefully puts the node back in sync with the cluster. Then, when you start the application up on the bad node, it will join up and stay in sync with the cluster.
Of course, this description lacks precise details, but that's the gist of it. There are surely less brute force ways of doing this, but unless you have massive amounts of data to replicate, I think this way is the quickest and cleanest.
At my new company, they keep all data associated with the data warehouse, including import, staging, audit, dimension and fact tables, together in the same physical database.
I've been a database developer for a number of years now and this consolidation of function and form seems counter to everything I know.
It seems to make security, backup/restore and performance management issues more manually intensive.
Is this something that is done in the industry? Are there substantial reasons for doing or not doing it?
The platform is Netezza. The size is in terabytes, hundreds of millions of rows.
What I'm looking to get from answers to this question is a solid understanding of how right or wrong this path is. From your experience, what are the issues I should be focused on arguing if this is a path that will cause trouble for us down the road. If it is no big deal, then I'd like to know that as well.
In general I would recommend using separate databases. This is the configuration I have always seen used in production and it really makes a lot of sense since - as you mentioned - both databases have fundamentally different purposes / usage patterns / etc.
Edit
If you're using one physical server, the fewer instances on that server the simpler the management and the more efficient the process.
If you put TWO instances on the same Physical Server you get:
Negatives:
Half the memory to use
Twice the count of database process
Positives:
You could take the entire staging db down without affecting the DW
So which is more precious to you, outage windows or CPU and Memory?
On the same the physical server multiple instances make performance management issues MUCH more manual to solve. If you look at the health of one of the instances, it might look fine but users are reporting poor performance, so you have to look at the next instance to see if the problem may be coming from there... and so on per instance.
Security is also harder with more than one instance. At best it's just as hard as a single instance but it's never easier. You'll have two admin accounts (SYS or something), Duplicate process accounts, etc.
Tell us why you think it's better to have more than one instance.
ORIGINAL POST
Can we be clear on terms. When you say "in the same Database" do you mean to say the same instance, or the same physical server. If you did move the staging to a new instance would it reside on the same physical hardware?
I think people get a little too hung up on instances. If you're going to put two instances on the same piece of hardware, you're only doubling the number of everything to very little advantage. All the server processes will be running twice... all the memory pools will be cut in half.
so let's say you really did mean two separate physical boxes...
Let's say you buy 2 12-way boxes (just say). When you're staging db server is done for the day, those 12 CPU's are wasting away. When your users pack up and go home, your prod DW CPUs are wasting away. CPU cycles are perishable, you can't get them back. BUT, if you had one 24 way box... then the staging DB COULD use 20 CPUs at night for some excellent Parallel Execution for building summary tables and your users will have double the capacity for processes during the day.
so let's say you meant the same hardware.
"It seems to make security, backup/restore and performance management issues more manually intensive."
Guaranteed that performance issues are harder to solve the more instances that share the same hardware. Guaranteed.
Security
What security do you do at the instance level?
Backup
What DW are you backing up at the instance level? You're not backing up tablespaces, but rather whole instances? Seems like that pattern will fail at a certain size.
PLATFORM: NETEZZA
Not familiar with the tool specifically. So if it's a single instance on a single box, then the division would seem more logical than physical and therefore the reasons they exist is for management, not performance. You don't increase your CPUs or memory by adding a database, right? So it doesn't seem like there's no performance upside to it. Each DB may be adding separate processes (performance hit), or it might be completely logical like schemas in Oracle. If each database is managed by new processes than data going between them will mean IPC.
Maybe the addition of the Netezza tag will get some traction.
We use databases for every segment (INVENTORY, CRM, BILLING...). There are no performance downsides and maintenance and overview is much better.
Better late than never, but for Netezza:
There are no performance hits while querying cross database. Netezza allows only SELECT operations cross database, no INSERT, UPDATE or DELETEstatements allowed.
This means you cannot do:
THISDB(ADMIN)=>INSERT INTO OTHERDB..TBL SELECT * FROM THISDBTABLE;
but you can do \c OTHERDB then
OTHERDB(ADMIN)=>INSERT INTO TBL SELECT * FROM THISDB..THISDBTABLE;
You are also not able to create a materialized view on a cross-database object, for example:
OTHERDB(ADMIN)=>CREATE MATERIALIZED VIEW BLAH AS SELECT * FROM THISDB..THISDBTABLE;
Administration might be where you will decide (though you probably already did long ago) on what kind of database(s) you'll create. Depending on your infrastructure, you might have a TEST/QA system and a PROD system on the same box, or on separate boxes.
You will gain speed in the load and the output if the tables are in the same schema (database). Obvious...but hey, I said it.
There is more overhead the more tables you put into one schema. Backups time, size of backups, ease of use.
Where I am, we have many multiple TB databases within one data-warehouse. Our rule of thumb is that a single loading process or a single report query should NOT have to span database. This keeps "like" tables together but gives some allowances for our backups and contingency processes. It also makes it a bit easier to "find" data.
For those processes that need to break this rule, we will either move data from one database to the other or allow the process to join across schemas.
I'm not as familiar with Netezza, so I'm not 100% sure what your options might be.
Few points for you to consider
a) If the data in one or more staging, audit, dimension and fact table has to be joined, you are better off keeping them in one database
b) Typically you will retain dimension tables and fact tables in the same database and distribute on most frequently joined columns to leverage "co-located join" functionality of Netezza
c) You should be able to use SQL grant permission to manage access to all objects (DB, tables, views etc)
When you hit a roof on reading from a database, you have two choices, scale vertically by putting more hardware in the server, or scale horizontally by putting a second server to help offload the reads.
Offloading reads to a second server, means that all writes will hit both servers, while read only hits one.
Problem is when you hit a roof with writing, since writing has to happen to all servers, it means that all servers will be overloaded with write requests, and the server comes unusable. Adding more servers to the problem doesn't help, since it only adds more servers that will be overloaded. So you have to scale vertically.
Is this something that is specific to RDBMS'? or is it something that happens with all DBMS'?
I know you can do things on software side, and split the database in two, eg. all entries starting with 0-m in one db while n-z in another, but IMHO it is more of a workaround than a solution to the problem.
I can't see that this would be specific to the relational model. All databases that have to read and write (and that's most of them) will have a similar problem.
For what it's worth, most databases are read far more than written so the write roof occurs less frequently than you might think. In addition, load balancing databases as per your method tends to be an immediate write to the primary with queued writes to all secondaries (at least in my experience).
In that case, you're not actually waiting around for multiple writes as a user, you just wait for the first. The DBMS itself manages the synchronisation between instances. This of course means that secondary databases might not be totally up-to-date but this can be controlled. Technically, this breaks the ACID properties of the system as a whole but this can be architected around.
I think this is the case with any DBMS, although some handle it better than others. Like you mention, partitioning the database in software seems to be the most common solution to this.
In many applications though, partitioning the database like that makes sense anyways if you are at such a huge scale that it becomes necessary. For example, if you had a social networking app, it would probably make sense to partition your database by country or other geographical regions. This would allow you to have your servers located geographically close to the regions they serve. It would also help mitigate any problems with a cross-database "social graph" since peoples friends tend to live nearby.
You're hardly going to "hit a roof with writing, since writing has to happen to all server" because in most of RDBMS installations:
1) Reads are overwhelming more frequent than writes
2) Modern RDBMs have Multi-Version Concurrency Control able to reduce blocking when reading/writing