Optimal Hosting for a Statistical Analysis Rails App - ruby-on-rails

I am developing a web app that performs regression analysis on user data.
on the backend, RoR is taking care of application logic, and all statistical analysis is done by R (since Ruby has poor stat packages).
Given that both R and RoR are single-threaded, and that the app is expected to be used concurrently by several users - I need your advice on the optimal configuration.
for example: should I run the R and RoR machines on separate instances and have RoR communicate with R via REST? run both on the same machine which can be clustered? use Revolution Analytics?
what would be a good hosting configuration to allow scalability of my app?

You could create a proxy to communicate to multiple webservers, and in turn each of these webservers communicates via a proxy to several R_servers. To have the proxy servers balance the load, you can look into something like Nginx's upstream directives.
The diagram below shows 3 webservers (which are exact clones of each other), and 3 R_servers (which are exact clones of each other). Use however many you need, since it's easy to add/remove the webservers or R_servers to scale horizontally.
webserver1 R_server1
/ \ /
proxy - webserver2 - proxy - R_server2
\ / \
webserver3 R_server3

Look at Rserve which, when hosted on Linux, forks off a new instance on every connection.
Connection is over the network, and there are Ruby clients available as indicated by a Google search

Related

Question regarding Monolithic vs. Microservice Architecture

I'm currently rethinking an architecture I was planning.
So suppose I have a system where there are about 8 different services interacting with a single database. Some services listen and react to database events and do stuff like sending SMS.
Then there's an API layer sitting on top of the database and a frontend connected to this API. So in my understanding this is rather monolithic.
In fact I don't see any advantage of using containers in this scenario. Their real advantage is that they can be swapped out, right? My intuition tells me that there is often no purpose in doing that except maybe some load balancing on API level. Instead many companies just seem to blindly jump on the hype train of containerizing everything.
Now the question arises, is docker the right tool for this context? In each forum people refrain from using docker for the sole purpose of a more resource efficient "VM" aggregating all services within a single container. However this is the only real scenario I'd see any advantages in using docker (the environment, e.g. alpine-linux, is the same on all customer's computers when rolling out the system).
Even docker-compose is not "grouping" containers together as a complete system only exposing port 443 but instead starts an infrastructure of multiple interacting containers. Oftentimes services like Kubernetes are then used for deploying these infrastructures on "nodes", i.e. VMs.
However, in my opinion it would be great to have a single self-contained container without putting them into a VM. This container would include every necessary service only exposing one port, e.g. 443.
Since I'm rather confused now, I'd really appreciate your help here.
Thanks in advance!
Kubernetes does many things and has many useful features. But Kubernetes also require that you architect your apps to follow The Twelve-Factor App principles. An important thing here is that your apps are stateless.
When the app is stateless, it is easy to scale out horizontally - this can also be done automatically when the load increases.
When the app is stateless, it is easy to do Rolling Deployments that upgrade the app to a new version without downtime.
You can run containers on bare metal Linux servers, but this is mostly very big servers. If you use a cloud, you probably want more VM instances, but distributed to 3 Availability Zones - for increased availability.
"Self-contained container - exposing one port". With Kubernetes, you typically use a private network and you only expose services via a single load balancer - typically on a port, but different URLs send traffic to different services.
Some services listen and react to database events and do stuff like sending SMS.
As I said, many things is easier when it is horizontal scalable, but this kind of app - that listen for events and react - is one of few examples where you can not scale horizontally. But it is a good fit for a serverless architecture instead, possibly on Kubernetes using Knative.
Now the question arises, is docker the right tool for this context?
My opinion is that most workload will run in containers. It is more a question about how it should be run in Kubernetes - one or multiple replicas. As stateless Deployments or stateful StatefulSet or some other way.

FoundationDB, the layer: Is it hosted on client application or server nodes?

Recently I was reading about concept of layers in FoundationDB. I like their idea, the decomposition of storage from one side and access to it from other.
There are some unclear points regarding implementation of the layers. Especially how they communicate with the storage engine. There are two possible answers: they are parts of server nodes and communicate with the storage by fast native API calls (e.g. as linked modules hosted in the server process) -OR- hosted inside client application and communicate through network protocol. For example, the SQL layer of many RDBMS is hosted on the server. And how are things with FoundationDB?
PS: These two cases are different from the performance view, especially when the clinent-server communication is high-latency.
To expand on what Eonil said: the answer rests on the distinction between two different sense of "client" and "server".
Layers are not run within the database server processes. They use the FDB client API to make requests of the database, and do not (with one exception*) get to pierce the transactional key-value abstraction.
However, there is nothing stopping your from running the layers on the same physical (or virtual) server machines as the database server processes. And, as that post from the community site mentions, there are use cases where you might very much wish to do this in order to minimize latencies.
*The exception is the Locality API, which is mostly useful in exactly those cases where you want to co-locate client-side layers with the data on which they operate.
Layers are on top of client-side library feature.
Cited from http://community.foundationdb.com/questions/153/what-layers-do-you-want-to-see-first
That's a good question. One reason that it doesn't always make sense
to run layers on the server is that in a distributed database, that
data is scattered--the servers themselves are a network hop away from
a random piece of data, just like the client.
Of course, for something like an analytics layer which is aware of
what data each server contains, it makes sense to run a distributed
version co-located with each of the machines in the FDB cluster.

openstack overkill for HA website stack?

Some background:
I'm building a pretty involved website (as far as used stack concerned). Components among some other smaller stuff include:
Elasticsearch
Redis
ZeroMQ
Couchbase
RethinkDB
traffic through Nginx -> Node
The intention is to have a high available website running but be pretty lean (and low cost) at the same time.
Current topology I'm considering:
2 webservers in active/active config with DNS-loadbalancing. (Nginx, static asset serving, etc. + loadbalancing to the second tier:
2 appservers in active/active. Most of the components like Elasticsearch can do sharding/replication themselves so this should not be as hard to set-up (fingers crossed)
session handling in replicated Redis
Naturally I want monitoring and alerting when something is wrong, and ideally the system should be able to handle failures automatically. Stuff like: promote Redis from Slave to Master, or even initialize a new ec2-instance, if I were to be on Ec2 that is.
However, I want to be free from a particular hosting provider. Which I believe (please correct if wrong) is where Openstack comes in.
Is it correct that:
- openstack allows me to control the entire lifecycle of my website-stack (covering multiple boxes / virtual machines? )
- Does it allow me to (with work on config of course) to spin-up instances, monitor, alert when something goes wrong, take appropriate actions in those scenario's, etc.?
Or is Openstack just entirely the wrong tool for the job? Anything else that would fit better as a sort of "management layer" on top of my entire website?
Thanks
OpenStack isn't VMWare ESX. It's not a very good straight up simple virtual machine hosting environment. If what you want is a way to easily manage virtual machines I might suggest Ganeti. It even has HA failover of virtual machines. In a two physical host environment, this is probably the way to go.
What OpenStack gives you that Ganeti won't is RESTful APIs. It has AWS Compatible APIs, but it has OpenStack APIs that are even better. If you want to automate elasticity or healability this is huge. Being able to link up in python using existing client APIs and just write scripts that spin up instances as needed is something joe DevOps is all about.
So I guess it comes down to what your level of commitment is and what you need. For 2 physical machines OpenStack probably isn't the best solution. But, down the line when you've got more apps and more vms than you can manage manually, openstack will be there to help you write code that makes your datacenter dance to your melodic tunes.

Can Couchbase 2.0 be accessed from erlang via erlmc & memchache 1.3 protocol?

I'm developing an application in erlang/elixir. I'd like to access Couchbase 2.0 from erlang. I found the erlmc project (https://github.com/JacobVorreuter/erlmc ) which is a binary protocol memcached client. The notes say "you must have a version 1.3 or greater of memcached."
I understand that Couchbase 2.0 uses memcached binary protocol for accessing data, and I'm looking for the best way to do this from erlang.
The manual talks about a "Couchbase API Port" on 8092, and calls the 11210 (close to the 11211 memcached normal port) as "internal cluster port".
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-network-ports.html
So, the question is this:
Is setting up erlmc to talk to Couchbase 2.0 on port 8092 the correct way to go about it?
Erlmc talks about how it hashes keys to find the right server, which makes me think that it might be too old of a version of the memcached protocol (or is there a built in MOXI on couchbase 2.0 that I should be connecting to? If so which port?)
Which is the port for the erlang views? And presumably the REST interface for views does not support straight key lookups, so I'll need to write code to access that as well, right?
I'm keen to use a pure erlang solution since NIFs are not concurrent and I'll have some unknown number of processes wanting to access Couchbase 2.0 at the same time.
The last time I worked with Couch was CouchDB, and so I'm trying to piece things together after the merger of Couch and Membase.
If I'm off on the wrong track, please advise on the best way to access Couchbase 2.0 from erlang in a highly concurrant manner. The memcached protocol should be pretty solid, thus possibly libraries a couple years old should work, right?
Thanks!
The short answer is: yes, Couchbase is compatible with memcached text protocol.
But the key point here is "memcached text protocol". Since memcached is using two different protocol types (text and binary), you should use those clients that are using text protocol.
At Mochi, we are using merle for memcached, and looks like it should work for you. Recently, one of my colleagues forked it and made some minor corrections: https://github.com/twonds/merle
Also, consider taking a look at https://github.com/EchoTeam/mcd. This client could use some refactoring, but is also production proven and even allows simple sharding.
Thanks to Xavier's contributions, I refactored the whole thing added pooling, now it builds and performs okay. I also included a basho_bench driver so you can benchmark it yourself. You can find the code on here . I am pretty sure this would perform better than text protocol.
I had to create own vbucket aware erlmc based erlang couchbase client.
The differences:
- http connection to retrieve vbucket map from couchbase
- fill two "reserved" bytes with vbucket id (see python client for example)
- active once async tcp connection for performance reason
The only answer I have so far is:
https://github.com/chitika/cberl
This project is based on the C++ "official" couchbase client.
It seems to have two possible problems:
1) it might be abandoned (last activity was 3 months ago)
2) it uses an NIF, which as I understand it, cannot be accessed concurrently.
We don't use Couchbase with Erlang, but with Python, which also needs to connect with a memcache client. I can't speak to the Erlang libraries specifically, but hopefully the lessons apply in both situations.
Memcache Client Limitations
Memcache clients can only access memcache functionality. You won't be able to use views or any other features not specified in the memcache protocol. If you want access to the views, you will need to use the REST protocol separately on port 8092 (docs).
Connecting to Couchbase with Vanilla Memcache Clients
The ports mentioned on that page are used either internally or by "smart" clients written for Couchbase specifically. By default, memcache clients can connect to the normal memcache port 11211 on any of the nodes in your Couchbase cluster. Do not use the memcache cluster features of any memcache client not written specifically for Couchbase; the usual methods of distribution for vanilla memcached are incompatible with Couchbase.
Explanation
In order to connect with the memcached client, you need to connect to port for the Couchbase bucket directly. When you set up a new bucket, you specify the port you want the bucket to be accessible on. The default bucket is setup on port 11211. Each bucket acts like an independent memcached instance, but is internally distributed to all nodes in the cluster. You can connect to the bucket port on any of the Couchbase servers, and you will be accessing the same data set.
This means that you should not try to use the distributed memcache features of your memcache client. Those features are designed for ad-hoc memcached clusters. Just connect to the appropriate port on the Couchbase server as if it was a single memcached server.
The reason this is possible is because there is a Moxi instance which finds the appropriate Couchbase server to process the request. This Moxi instance automatically runs for each bucket on every Couchbase server. Even though you may not be connected to the node which has your specific key, Moxi will transparently direct your request to the appropriate server.
In this way, you can use a vanilla Memcache client to talk to Couchbase, without needing any additional logic to keep track of cluster topology. Moxi takes care of that piece for you.
Binary protocol
We did have the binary protocol working at one point, but there were problems when we tried to use the flush_all command. That was a while ago, though. I suggest experimenting yourself to see if the level of support meets your needs.

Best practice for rate limiting users of a REST API?

I am putting together a REST API and as I'm unsure how it will scale or what the demand for it will be, I'd like to be able to rate limit uses of it as well as to be able to temporarily refuse requests when the box is over capacity or if there is some kind of slashdotted scenario.
I'd also like to be able to gracefully bring the service down temporarily (while giving clients results that indicate the main service is offline for a bit) when/if I need to scale the service by adding more capacity.
Are there any best practices for this kind of thing? Implementation is Rails with mysql.
This is all done with outer webserver, which listens to the world (i recommend nginx or lighttpd).
Regarding rate limits, nginx is able to limit, i.e. 50 req/minute per each IP, all over get 503 page, which you can customize.
Regarding expected temporary down, in rails world this is done via special maintainance.html page. There is some kind of automation that creates or symlinks that file when rails app servers go down. I'd recommend relying not on file presence, but on actual availability of app server.
But really you are able to start/stop services without losing any connections at all. I.e. you can run separate instance of app server on different UNIX socket/IP port and have balancer (nginx/lighty/haproxy) use that new instance too. Then you shut down old instance and all clients are served with only new one. No connection lost. Of course this scenario is not always possible, depends on type of change you introduced in new version.
haproxy is a balancer-only solution. It can extremely efficiently balance requests to app servers in your farm.
For quite big service you end-up with something like:
api.domain resolving to round-robin N balancers
each balancer proxies requests to M webservers for static and P app servers for dynamic content. Oh well your REST API don't have static files, does it?
For quite small service (under 2K rps) all balancing is done inside one-two webservers.
Good answers already - if you don't want to implement the limiter yourself, there are also solutions like 3scale (http://www.3scale.net) which does rate limiting, analytics etc. for APIs. It works using a plugin (see here for the ruby api plugin) which hooks into the 3scale architecture. You can also use it via varnish and have varnish act as a rate limiting proxy.
I'd recommend implementing the rate limits outside of your application since otherwise the high traffic will still have the effect of killing your app. One good solution is to implement it as part of your apache proxy, with something like mod_evasive

Resources