cas operation behavior under network partitioning - membase

The Couchbase 2.0 manual describes network partitioning as a potential issue.
http://docs.couchbase.org/couchbase-manual-2.0/couchbase-architecture.html#couchbase-architecture-failover-automatic-considerations
But I didn't see how (if) Couchbase 2.0 deal with such issues on the datastore side.
My question is how is CAS implemented in a cluster and how does CAS operations deal with the split-brain problem? Is there a cluster wide lock? Is it last writer wins?

The same question was asked to our Google Groups list: http://groups.google.com/group/couchbase/browse_thread/thread/e0d543d9b17f9c77
It's down at the bottom of the thread, posts starting on Aug 30
Perry

Membase and Couchbase Server 2.0 are partitioning data. For each piece of data (vbucket) there's always single server that is source of truth.
Good side of this is that it's always strictly consistent. There's no need to design for conflict resolution etc.
But when some node goes down, you simply lose access to subset of your data. You can do failover in which case replicas will be promoted to masters for vbuckets that were lost, thus 'recoving' access to this vbuckets. Note that losing some recent mutations is unavoidable in that case due to some replication lag. And failover is manual operation (although recent version has very carefully implemented and limited autofailover).

Related

How to do Neo4j Cache-based Sharding?

I've been reading Neo4j's Operational Manual on Cache Sharding, and posts all over the web, however I can hardly find any detailed example on how to configure HAProxy for cache sharding(yes the one on Operation Manual is rather brief) on a real-world graph, which may contain multiple node labels.
Has anyone ever done this before? Would be lovely if you could share your experience.
Moreover, I'm a bit confused on the mechanism of the way to shard the graph using HAProxy. How do sub-graphs get cached on certain slaves, merely by providing rules in HAProxy? It surprised me to learn that cache sharding isn't handled by Neo4j.
The goal is to send queries hitting the same region of your graph always to the same instance. This of course means that the request data indicates the region. What to use as "region indicator" is heavily depending on the structure and shape of your graph.
In a lot of cases of customer facing applications people successfully used the current user id and set it as additional http header which is then evaluated by haproxy.

FoundationDB, the layer: Is it hosted on client application or server nodes?

Recently I was reading about concept of layers in FoundationDB. I like their idea, the decomposition of storage from one side and access to it from other.
There are some unclear points regarding implementation of the layers. Especially how they communicate with the storage engine. There are two possible answers: they are parts of server nodes and communicate with the storage by fast native API calls (e.g. as linked modules hosted in the server process) -OR- hosted inside client application and communicate through network protocol. For example, the SQL layer of many RDBMS is hosted on the server. And how are things with FoundationDB?
PS: These two cases are different from the performance view, especially when the clinent-server communication is high-latency.
To expand on what Eonil said: the answer rests on the distinction between two different sense of "client" and "server".
Layers are not run within the database server processes. They use the FDB client API to make requests of the database, and do not (with one exception*) get to pierce the transactional key-value abstraction.
However, there is nothing stopping your from running the layers on the same physical (or virtual) server machines as the database server processes. And, as that post from the community site mentions, there are use cases where you might very much wish to do this in order to minimize latencies.
*The exception is the Locality API, which is mostly useful in exactly those cases where you want to co-locate client-side layers with the data on which they operate.
Layers are on top of client-side library feature.
Cited from http://community.foundationdb.com/questions/153/what-layers-do-you-want-to-see-first
That's a good question. One reason that it doesn't always make sense
to run layers on the server is that in a distributed database, that
data is scattered--the servers themselves are a network hop away from
a random piece of data, just like the client.
Of course, for something like an analytics layer which is aware of
what data each server contains, it makes sense to run a distributed
version co-located with each of the machines in the FDB cluster.

Scaling HBase writes on a cluster using Thrift

We're trying to scale up HBase writes on a cluster using Thrift. (Our HBase application is in Python, and hence needs Thrift.)
Despite increasing the number of nodes in the cluster, we are seeing the same write speeds.
First off, is the recommended strategy to run Thrift on:
1. The client?
2. The HBase master?
3. HBase region servers?
If on #1 or #2, will the client or HBase master take care of splitting the requests to the various region servers? It doesn't appear to in our case.
If #3, then I have to modify the client to write to the specific region servers, and randomize the writes. I can do this, but it seems to defeat the purpose of using HBase.
Any other tips on read/write scaling (especially with Thrift) are greatly appreciated.
In HBase to gain performance with node increase you should have a decent "rowkey" distribution. As long as you have "hot spots" (a very busy region server) in your cluster you would not gain anything from increasing your cluster size. checkout the article on row key design to start with.
If you don't need to read right away (if you are comfortable with async writes) you can check asynch hbase client from stumbleupon for performance gain.
I found the answer at these two questions, it looks like we'll go with #3 (write to the specific region servers, and randomize the writes):
Is it better to send data to hbase via one stream or via several servers concurrently?
HBase Thrift: how to connect to remote HBase master/cluster?

Horizontal scaling of JSF 2.0 application

Given that JavaServer Faces is inherently stateful on the server side, what methods are recommended for horizontally scaling a JSF 2.0 application?
If an application runs multiple JSF servers, I can imagine the following scenarios:
Sticky Sessions: send all requests matching a given session to the same server.
Question: what technology is commonly used to achieve this?
Problem: server failure results in lost sessions... and generally seems like fragile architecture, especially when starting fresh (not trying to scale an existing application)
State (Session) Replication: replicate JSF state on all JSF servers in cluster
Question: what technology is commonly used to achieve this?
Problem: does not scale. total memory of cluster = total memory on smallest server
Instruct JSF (via configuration) to store its state on an external resource (e.g. another server running a very fast in-memory database), then access that resource from the JSF servers when application state is needed?
Question: is this possible?
Instruct JSF (via configuration) to be stateless?
Question: is this possible?
[EDIT]
Updated in response to Ravi's suggestion of Sticky Sessions
This can be achieved by configuring your load balancer in sticky session mode.
More info
This way all your subsequent requests are sent to the same application server.
Here's an idea from Jelastic PaaS:
Pair-up application instances in 2-server clusters, and apply session replication between those 2 instances within one cluster. Then you can add as many 2-instance clusters as you want and load balance requests between clusters, with each session sticking to the cluster it originated from. Within cluster, requests could be load balanced between instances.
This way there is some degree of fail safety, since if one instance in cluster fails, the other takes over, with same session state. On the other hand, memory impact is not as severe as with full replication.
In short, it is combination of 1. and 2. from your question. Of course, there can be more than 2 instances in each cluster, if availability is of greater concern.
Link to Jelastic docs I lifted the idea from: http://jelastic.com/docs/session-replication.
Disclaimer: I don't actually know how to configure this with JSF2, and have no affiliation with Jelastic. Just liked the idea and thought it might help.
What about session replication with "buddy" semantics?
With one buddy total memory is halved (every server needs to hold the session data of two servers), which is a lot better than having to hold data of each and every server out there.
Buddy replication also reduces bandwidth overhead.

Why is membase server so slow in response time?

I have a problem that membase is being very slow on my environment.
I am running several production servers (Passenger) on rails 2.3.10 ruby 1.8.7.
Those servers communicate with 2 membase machines in a cluster.
the membase machines each have 64G of memory and a100g EBS attached to them, 1G swap.
My problem is that membase is being VERY slow in response time and is actually the slowest part right now in all of the application lifecycle.
my question is: Why?
the rails gem I am using is memcache-northscale.
the membase server is 1.7.1 (latest).
The server is doing between 2K-7K ops per second (for the cluster)
The response time from membase (based on NewRelic) is 250ms in average which is HUGE and unreasonable.
Does anybody know why is this happening?
What can I do inorder to improve this time?
It's hard to immediately say with the data at hand, but I think I have a few things you may wish to dig into to narrow down where the issue may be.
First of all, do your stats with membase show a significant number of background fetches? This is in the Web UI statistics for "disk reads per second". If so, that's the likely culprit for the higher latencies.
You can read more about the statistics and sizing in the manual, particularly the sections on statistics and cluster design considerations.
Second, you're reporting 250ms on average. Is this a sliding average, or overall? Do you have something like max 90th or max 99th latencies? Some outlying disk fetches can give you a large average, when most requests (for example, those from RAM that don't need disk fetches) are actually quite speedy.
Are your systems spread throughout availability zones? What kind of instances are you using? Are the clients and servers in the same Amazon AWS region? I suspect the answer may be "yes" to the first, which means about 1.5ms overhead when using xlarge instances from recent measurements. This can matter if you're doing a lot of fetches synchronously and in serial in a given method.
I expect it's all in one region, but it's worth double checking since those latencies sound like WAN latencies.
Finally, there is an updated Ruby gem, backwards compatible with Fauna. Couchbase, Inc. has been working to add back to Fauna upstream. If possible, you may want to try the gem referenced here:
http://www.couchbase.org/code/couchbase/ruby/2.0.0
You will also want to look at running Moxi on the client-side. By accessing Membase, you need to go through a proxy (called Moxi). By default, it's installed on the server which means you might make a request to one of the servers that doesn't actually have the key. Moxi will go get it...but then you're doubling the network traffic.
Installing Moxi on the client-side will eliminate this extra network traffic: http://www.couchbase.org/wiki/display/membase/Moxi
Perry

Resources