Scaling HBase writes on a cluster using Thrift - scalability

We're trying to scale up HBase writes on a cluster using Thrift. (Our HBase application is in Python, and hence needs Thrift.)
Despite increasing the number of nodes in the cluster, we are seeing the same write speeds.
First off, is the recommended strategy to run Thrift on:
1. The client?
2. The HBase master?
3. HBase region servers?
If on #1 or #2, will the client or HBase master take care of splitting the requests to the various region servers? It doesn't appear to in our case.
If #3, then I have to modify the client to write to the specific region servers, and randomize the writes. I can do this, but it seems to defeat the purpose of using HBase.
Any other tips on read/write scaling (especially with Thrift) are greatly appreciated.

In HBase to gain performance with node increase you should have a decent "rowkey" distribution. As long as you have "hot spots" (a very busy region server) in your cluster you would not gain anything from increasing your cluster size. checkout the article on row key design to start with.
If you don't need to read right away (if you are comfortable with async writes) you can check asynch hbase client from stumbleupon for performance gain.

I found the answer at these two questions, it looks like we'll go with #3 (write to the specific region servers, and randomize the writes):
Is it better to send data to hbase via one stream or via several servers concurrently?
HBase Thrift: how to connect to remote HBase master/cluster?

Related

Database synchronization time in cassandra

I have two controllers accessing distributed database. I am receiving some data from a device to the controllers and i store them in Cassandra database. I use Docker to install cassandra
The node 1 is on controller 1 and node 2 is on controller 2. I would like to know if there is a possibility to measure the time it takes to update the node 2, when i receive data at node 1.
I would like to draw a graph with it. So could someone tell me how do i measure it.
Thanks
Cassandra provides tools and insights of all this internal information with the nodetool gossipinfo command and cqlsh tracing.
In the scenario that you are proposing, I'm inferring that you are using a Replication Factor of 2, and that you are interested in the exact time that is taking to have the information written in all the nodes, you can measure the time required to do a write with the consistency level set to ALL, and compare it with similar writes using the consistency level of ONE. The difference of the times will be the propagation from one node to the other.
Finally, if you are interested in measuring the performance of the queries in Cassandra, there are several tools that enhance the tracing functionality, in our team we have been using zipkin with good results.

Bosun HA and scalability

I have a minor bosun setup, and its collecting metrics from numerous services, and we are planning to scale these services on the cloud.
This will mean more data coming into bosun and hence, the load/efficiency/scale of bosun is affected.
I am afraid of losing data, due to network overhead, and in case of failures.
I am looking for any performance benchmark reports for bosun, or any inputs on benchmarking/testing bosun for scale and HA.
Also, any inputs on good practices to be followed to scale bosun will be helpful.
My current thinking is to run numerous bosun binaries as a cluster, backed by a distributed opentsdb setup.
Also, I am thinking is it worthwhile to run some bosun executors as plain 'collectors' of scollector data (with bosun -n command), and some to just calculate the alerts.
The problem with this approach is it that same alerts might be triggered from multiple bosun instances (running without option -n). Is there a better way to de-duplicate the alerts?
The current best practices are:
Use https://godoc.org/bosun.org/cmd/tsdbrelay to forward metrics to opentsdb. This gets the bosun binary out of the "critical path". It should also forward the metrics to bosun for indexing, and can duplicate the metric stream to multiple data centers for DR/Backups.
Make sure your hadoop/opentsdb cluster has at least 5 nodes. You can't do live maintenance on a 3 node cluster, and hadoop usually runs on a dozen or more nodes. We use Cloudera Manager to manage the hadoop cluster, and others have recommended Apache Ambari.
Use a load balancer like HAProxy to split the /api/put write traffic across multiple instances of tsdbrelay in an active/passive mode. We run one instance on each node (with tsdbrelay forwarding to the local opentsdb instance) and direct all write traffic at a primary write node (with multiple secondary/backup nodes).
Split the /api/query traffic across the remaining nodes pointed directly at opentsdb (no need to go thru the relay) in an active/active mode (aka round robin or hash based routing). This improves query performance by balancing them across the non-write nodes.
We only run a single bosun instance in each datacenter, with the DR site using the read only flag (any failover would be manual). It really isn't designed for HA yet, but in the future may allow two nodes to share a redis instance and allow active/active or active/passive HA.
By using tsdbrelay to duplicate the metric streams you don't have to deal with opentsdb/hbase replication and instead can setup multiple isolated monitoring systems in each datacenter and duplicate the metrics to whichever sites are appropriate. We have a primary and a DR site, and choose to duplicate all metrics to both data centers. I actually use the DR site daily for Grafana queries since it is closer to where I live.
You can find more details about production setups at http://bosun.org/resources including copies of all of the haproxy/tsdbrelay/etc configuration files we use at Stack Overflow.

Neo4j - Difference between High Availability and Distributed Mechanism?

Neo4j explain about the clustering through a concept called High Availability. And, What I know about clustering with hadoop knowledge is distributed computing.
What are the difference between these two concepts?
Thanks
Neo4j High Availability refers to an approach for scaling the number of requests to which Neo4j can respond. Neo4j HA implements a master slave with replication clustering model for high availability scaling. This means that all writes go to the "master" server (or are proxied to master from the slaves) and the update is synchronized among the slave servers. Reads can be sent to any server in the cluster, scaling out the number of requests to which the database can respond.
Compare this to distributed computing, which is a general term to describe how computation operations can be done in parallel across a large number of machines. One key difference is the concept of data sharding. With Neo4j each server in the cluster contains a full copy of the graph, whereas with a distributed filesystem such as HDFS, the data is sharded and each machine stores only a small piece of the entire dataset.
Part of the reason Neo4j does not shard the graph is that since a graph is a highly connected data structure, traversing through a distributed/sharded graph would involve lots of network latency as the traversal "hops" from machine to machine.

cas operation behavior under network partitioning

The Couchbase 2.0 manual describes network partitioning as a potential issue.
http://docs.couchbase.org/couchbase-manual-2.0/couchbase-architecture.html#couchbase-architecture-failover-automatic-considerations
But I didn't see how (if) Couchbase 2.0 deal with such issues on the datastore side.
My question is how is CAS implemented in a cluster and how does CAS operations deal with the split-brain problem? Is there a cluster wide lock? Is it last writer wins?
The same question was asked to our Google Groups list: http://groups.google.com/group/couchbase/browse_thread/thread/e0d543d9b17f9c77
It's down at the bottom of the thread, posts starting on Aug 30
Perry
Membase and Couchbase Server 2.0 are partitioning data. For each piece of data (vbucket) there's always single server that is source of truth.
Good side of this is that it's always strictly consistent. There's no need to design for conflict resolution etc.
But when some node goes down, you simply lose access to subset of your data. You can do failover in which case replicas will be promoted to masters for vbuckets that were lost, thus 'recoving' access to this vbuckets. Note that losing some recent mutations is unavoidable in that case due to some replication lag. And failover is manual operation (although recent version has very carefully implemented and limited autofailover).

Scaling with a cluster- best strategy

I am thinking about the best strategy to scale with a cluster of servers. I know there is no hard and fast rules, but I am curious what people think about these scenarios:
cluster of combination app/db servers that are round robin (with failover) balanced using dnsmadeeasy. the db's are synced using replication. Has the advantage that capacity can be augmented easily by adding another server to the cluster, and it is naturally failsafe.
cluster of app servers, again round robin load balanced (with failover) using dnsmadeeasy, all reporting to a big DB server in the back. easy to add app servers, but the single db server creates a single failure point. Could possible add a hot standby with replication.
cluster of app servers (as above) using two databases, one handling reads only, and one handling writes only.
Also, if you have additional ideas, please make suggestions. The data is mostly denormalized and non relational, and the DBs are 50/50 read-write.
Take 2 physical machines and make them Xen servers
A. Xen Base alpha
B. Xen Base beta
In each one do three virtual machines:
"web" server for statics(css,jpg,js...) + load balanced proxy for dynamic request (apache+mod-proxy-balancer,nginx+fair)
"app" server (mongrel,thin,passenger) for dynamic requests
"db" server (mySQL, PostgreSQL...)
Then your distribution of functions can be like this:
A1 owns your public ip and handle requests to A2 and B2
B1 pings A1 and takes over if ping fails
A2 and B2 take dynamic request querying A3 for data
A3 is your dedicated data server
B3 backups A3 second to second and offer readonly access to make copies, backups etc.
B3 pings A3 and become master if A3 becomes unreachable
Hope this can help you some way, or at least give you some ideas.
It really depends on your application.
I've spent a bit of time with various techniques for my company and what we've settled on (for now) is to run a reverse proxy/loadbalancer in front of a cluster of web servers that all point to a single master DB. Ideally, we'd like a solution where the DB is setup in a master/slave config and we can promote the slave to master if there are any issues.
So option 2, but with a slave DB. Also for high availability, two reverse proxies that are DNS round robin would be good. I recommend using a load balancer that has a "fair" algorithm instead of simple round robin; you will get better throughput.
There are even solutions to load balance your DB but those can get somewhat complicated and I would avoid them until you need it.
Rightscale has some good documentation about this sort of stuff available here: http://wiki.rightscale.com/
They provide these types of services for the cloud hosting solutions.
Particularly useful I think are these two entries with the pictures to give you a nice visual representation.
The "simple" setup:
http://wiki.rightscale.com/1._Tutorials/02-AWS/02-Website_Edition/2._Deployment_Setup
The "advanced" setup:
http://wiki.rightscale.com/1._Tutorials/02-AWS/02-Website_Edition/How_do_I_set_up_Autoscaling%3f
I'm only going to comment on the database side:
With a normal RDBMS a 50/50 read/write load for the DB will make replication "expensive" in terms of overhead. For almost all cases having a simple failover solution is less costly than implementing a replicating active/active DB setup. Both in terms of administration/maintenance and licensing cost (if applicable).
Since your data is "mostly denormalized and non relational" you could take a look at HBase which is an OSS implementation of Google Bigtable, a column based key/value database system. HBase again is built on top of Hadoop which is an OSS implementation of Google GFS.
Which solution to go with depends on your expected capacity growth where Hadoop is meant to scale to potentially 1000s of nodes, but should run on a lot less as well.
I've managed active/active replicated DBs, single-write/many-read DBs and simple failover clusters. Going beyond a simple failover cluster opens up a new dimension of potential issues you'll never see in a failover setup.
If you are going for a traditional SQL RDBMS I would suggest a relatively "big iron" server with lots of memory and make it a failover cluster. If your write ratio shrinks you could go with a failover write cluster and a farm of read-only servers.
The answer lies in the details. Is your application CPU or I/O bound? Will you require terabytes of storage or only a few GB?

Resources