What it the real benefit from Erlang's fault tolerance for a web project? - erlang

Let's assume we have a web project in which we want to have ~10000 web clients connected to the server simultaneously. Let's also assume that one client session lasts about 25 minutes.
If we compare LAMP stack or any other popular web stack/framework (Ruby on Rails with Apache on Linux, etc.) to a web project built in Erlang/OTP - what does Erlang/OTP have in terms of fault tolerance that other frameworks don't?
What event can happen to a client that will cause the whole LAMP stack crash, while Erlang/OTP will stand its ground?

Note that a typical LAMP-stack does employ some fault tolerance. In particular, if a request in the LAMP stack fail, only that request will, while the rest of the code will run on. This kind of protection allows you to have faults in a single request without that hurting other requests.
Erlang provides this idea of "able to cope with smaller unforseen errors" at a much finer grained scale. You may have other subsystems in an application, and the same kind of tolerance to errors can be extended to those. You won't get it "for free", but the tooling is there to build a system which is robust. Imagine a client error in the LAMP stack. This will often lead to a disconnect of that client. It may not be so in Erlang and the client can keep on running.
For a system of 10000 clients, Erlang provides the advantage that you can have a process per client. Or perhaps 10 processes per client. That is much harder to pull of in many languages since a process/thread is rather heavy and expensive. Note that interprocess communication between the clients are easy, also if some clients are on another machine (imagine extending to a distributed cluster some day).
If you write your code in a certain way, you can make sure that if a client crashes, for one reason or the other, then its state is properly cleaned up by other processes. That can avoid lots and lots of small nasty leaks in state as well.

what does Erlang/OTP have in terms of fault tolerance that other frameworks don't?
Now, Erlang/OTP has very minimal if not, Zero side effects. Because of its concurrency, a web server like yaws literally spawns a small web server for every connection. If one user is affected by a given web service fault in your application, all other users will never notice and the only that users process may exit.
With OTP, you can build applications with a number of supervisors, such that if a server goes down, its restarted and many other functions and options you may need.
Erlang's distribution allows us to write distributed applications. I have personally used yaws web server in building a web application.
My experience is that which ever web server you may pick, say Mochiweb, a tutorial found here: http://alexmarandon.com/articles/mochiweb_tutorial/, is quite impressive in performance. Infact, a few years back, the oldest version of yaws web Server was bench marked with the (then) newest version of Apache, and the results of the bench mark are very awakening. When i went through a million user comet application with Mochiweb, which has Part 2 and Part 3, i was impressed. These are some of the few examples of powerful web frameworks built for the web.
And by the way, have you heard of these new NO SQL Databases with REST (HTTP) interface developed in Erlang/OTP e.g. Membase Server [home page here: http://www.couchbase.com/products-and-services/membase-server], Couch DB, and Riak, their performance is very impressive, meaning that their web/REST interface is very stable and due to their impressive , documented write through put, they have proved that their underlying Technology (Erlang/OTP), was built not only for high availability and fault-tolerant systems only, but for the web too! Just read through this document: http://blog.couchbase.com/why-membase-uses-erlang
Many more web frameworks built in erlang and are very impressive can be found listed on the wiki page here: http://en.wikipedia.org/wiki/Erlang_(programming_language). A probably better summary of its features that make it powerful on the web can be found in here: http://cs.nyu.edu/~lerner/spring10/projects/Erlang.pdf

Related

Is Erlang bad language for this app?

I am building framework for realtime web applications. I started to do it in Elixir, because
it is modern way how to develop application for Erlang VM. Erlang should be good if you need concurrency, fault tolerant, scalable apps (something like web server etc.). That is exactly what i need.
Question: Realtime framework always need for instance keep information about who is interested in what. This will be accomplished by using publish/subscribe pattern. So i will have 1000 clients subscribing to topic "newest-message". I need to save those clients (pid of process representing each client) somewhere to later access them if content for topic "newest-message" appears.
This is where i am confused if Erlang is really good for my framework.
ETS is probably the only option where to store shared data, but ETS is always copying everything if you save/access records. So that means copy 1000 pids always when i need to access them (instead of just iterating over some list, if i will do it for instance in c/java/python).
This will be probably great bottleneck if still copying many and many records from ETS (many clients, many subscriptions etc), i am right?
Sharing the state may be a sign of bad design. You can for example have process for each queue/topic and it will store its own list of subscribers. You send a message to that topic process and it in turn sends the message to clients. This way, you don't copy entire subscriber list.
If you need to process them in parallel, you can split the subscriber list between more processes.
The fault tolerance of Erlang is achieved, because it doesn't let you share state and you have to put more thought to the design, that will not involve state sharing, but will be efficient. This will pay off in the long run, so Erlang/Elixir is definitely good language for this kind of apps. Just look at RabbitMQ.
In my opnion, if you plan to save states like "who is interested in what" Erlang alone may not be a good idea. Of course, sometimes it is very convenient to pass everything in signals (like you'd do in Erlang), but when there is much content to store - lack of state in Erlang starts to hinder you rather than help.
On the other hand, you can keep a broad piece of convenience of Erlang and use it with a Java application, for example. Erlangs interface for Java enables you to connect both technologies quite easily, and at the same time you can use a Java app to store information for you (and save them somewhere, when necessary) and Erlang for the whole concurrent signaling real time part. Even better than that: you can still implement OTP with architecture like that, so you can create quite a lightweight application (because real-time logic is done by Erlang for you) being able to access stored data easily (because Java helps you here).

Cloud computing: Learn to scale server up/down automatically

I'm really impressed with the power of cloud computing when it comes to the possibility to scale up and down your facilities depending on your load.
How can I shift my paradigm and learn to write my applications in that way? Write it once and forget(no matter of the future load) would be the best solution.
How can I practice my skills in that area?
Setup virtualization environment when I can add another VMs into the private cloud(via command line?) on some smart algorithms to foresee the load for some period of time?
Ideally I want to practice it without buying actual Cloud computing services and just on my hardware.
The only thing I want to practice here is app/web role and/or message queue systems scaling when current workers have too many jobs in queue. So let's rule out database scaling from the question's goal as too big topic.
One option I will throw out is to use a native Cloud execution framework. You might look at CloudIQ Platform. One component is CloudIQ Engine. It allows you to develop cloud native apps in C/C++, Java and .NET. You get the capabilities of scale up by simply adding workers to your cloud. The framework automatically distributes your applications to the new machine(s), and once installed, will begin sending work to them as requests come in. So in effect the cloud handles your queueing issue for you.
Check out the Download and Community links for more information.
You should try AWS- Amazon's offering a free tier that gives you storage, messaging and micro instances (only linux). you can start developing small try-outs without paying. writing an application that scales isn't that hard- try to break your flow into small, concurrent tasks. client-server applications are even easier- use a load balancer to raise\terminate servers by demand.

How to specifically identify server issues from a load test result (using LoadRunner)?

How do you isolate a performance issue to a specific component of the application infrastructure? Specifically, are there distinct markers in the result logs that distinguish between bottlenecks at web, application and/or database server levels?
I was asked this question in an interview and went blank on it. Seems this information is not available anywhere.
In addition to SiteScope and other agentless monitoring of system components, you need to make sure your scenario and scripts are working as expected. This includes proper error checking and use of transactions (and a host of other things). If the transactions are granular enough, this will give you insight into at least the requests that have performance issues. Once you have these indicators, work with the infrastructure team to review logs and other information. Being an iterative process, tests can be made to focus on a smaller and smaller section of the infrastructure.
In addition, loadrunner scripts don't have to be made strictly 'coming in through the frontdoor'. If you have a multi-tiered system, scripts can be made to hit the web/app/database servers directly.
For what to look for, focus on any measurements that have 'knees' or 'hockey stick' type of behaviour. You can hook into any of the server resource type measurements directly in the controller and integrate other team's stats in the analysis phase. Compare with benchmarks at lower virtual user levels to determine what is acceptable and unacceptable.
Good luck!
If the interview is focused on LoadRunner and SiteScope is considered - I'd come to conclusion that it's more focused on HP/Mercury solutions.. In that case I'd suggest you to look into HP Diagnostics and it's LoadRunner integration capabilities.
This type of information is usually not available by just looking at the standard results from a performance test.
Parts of the information you are looking for MAY be found by using SiteScope to monitor all the relevant servers in the test. SiteScope offers many counters to look at such as CPU, Memory, Disk I/O and Network I/O - as seen on each server.
This information perhaps gives clues as to where the bottleneck is, and the more counters you add to SiteScope, the bigger the change to pinpoint the bottleneck.
It is a very common misconception that AppServer and DBServer bottlenecks could be identified by just looking at the raw response times or hits, pages etc (web protocol), unless of course the URI accessed defines the exact component(s) in the system...

What is the most common approach for designing large scale server programs?

Ok I know this is pretty broad, but let me narrow it down a bit. I've done a little bit of client-server programming but nothing that would need to handle more than just a couple clients at a time. So I was wondering design-wise what the most mainstream approach to these servers is. And if people could reference either tutorials, books, or ebooks.
Haha ok. didn't really narrow it down. I guess what I'm looking for is a simple but literal example of how the server side program is setup.
The way I see it: client sends command: server receives command and puts into queue, server has either a single dedicated thread or a thread pool that constantly polls this queue, then sends the appropriate response back to the client. Is non-blocking I/O often used?
I suppose just tutorials, time and practice are really what I need.
*EDIT: Thanks for your responses! Here is a little more of what I'm trying to do I suppose.
This is mainly for the purpose of learning so I'd rather steer away from use of frameworks or libraries as much as I can. Take for example this somewhat made up idea:
There is a client program it does some function and constantly streams the output to a server(there can be many of these clients), the server then creates statistics and stores most of the data. And lets say there is an admin client that can log into the server and if any clients are streaming data to the server it in turn would stream that data to each of the admin clients connected.
This is how I envision the server program logic:
The server would have 3 Threads for managing incoming connections(one for each port listening on) then spawning a thread to manage each connection:
1)ClientConnection which would basically just receive output, which we'll just say is text
2)AdminConnection which would be for sending commands between server and admin client
3)AdminDataConnection which would basically be for streaming client output to the admin client
When data comes in from a client to the server the server parses what is relevant and puts that data in a queue lets say adminDataQueue. In turn there is a Thread that watches this queue and every 200ms(or whatever) would check the queue to see if there is data, if there is, then cycle through the AdminDataConnections and send it to each.
Now for the AdminConnection, this would be for any commands or direct requests of data. So you could request for statistics, the server-side would receive the command for statistics then send a command saying incoming statistics, then immediately after that send a statistics object or data.
As for the AdminDataConnection, it is just the output from the clients with maybe a few simple commands intertwined.
Aside from the bandwidth concerns of the logical problem of all the client data being funneled together to each of the admin clients. What sort of problems would arise from this design due to scaling issues(again neglecting bandwidth between clients and server; and admin clients and server.
There are a couple of basic approaches to doing this.
Worker threads or processes. Apache does this in most of its multiprocessing modes. In some versions of this, a thread or process is spawned for each request when the request arrives; in other versions, there's a pool of waiting threads which are assigned work as it arrives (avoiding the fork/thread create overhead when the request arrives).
Asynchronous (non-blocking) I/O and an event loop. This is basically using the UNIX select call (although both FreeBSD and Linux provide more optimized alternatives such as kqueue). lighttpd uses this approach and is able to achieve very high scalability, but any in-server computation blocks all other requests. Concurrent dynamic request handling is passed on to separate processes (via CGI) or waiting processes (via FastCGI or its equivalent).
I don't have any particular references handy to point you to, but if you look at the web sites for open source projects using the different approaches for information on their design wouldn't be a bad start.
In my experience, building a worker thread/process setup is easier when working from the ground up. If you have a good asynchronous framework that integrates fully with your other communications tasks (such as database queries), however, it can be very powerful and frees you from some (but not all) thread locking concerns. If you're working in Python, Twisted is one such framework. I've also been using Lwt for OCaml lately with good success.

What are the requirements for an application health monitoring system?

What, at a minimum, should an application health-monitoring system do for you (the developer) and/or your boss (the IT Manager) and/or the operations (on-call) staff?
What else should it do above the minimum requirements?
Is monitoring the 'infrastructure' applications (ms-exchange, apache, etc.) sufficient or do individual user applications, web sites, and databases also need to be monitored?
if the latter, what do you need to know about them?
ADDENDUM: thanks for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both
Whether the application is running.
Unusual cpu/memory/network usage.
Report any unhandled exceptions.
Status of various modules (if applicable).
Status of external components (databases, webservices, fileservers, etc.)
Number of pending background tasks (if applicable).
Maybe track usage of the application and report statistics on most/less used functionalities so you know where optimizations are most beneficial.
The answer is 'it depends'. Why do you need to monitor? How large is your operations staff? Do you need reporting? What is the application environment? Who cares if the application fails? Who cares if an exception happens? Are any of the errors recoverable? I could ask questions like these for a long time.
Great question.
We've been looking for some application-level monitoring solution for our needs some time ago without any luck. Popular monitoring solution are mostly addressed to monitor infrastrcture and - in my opinion - they are too complicated for a requirements of most of small and mid-sized companies.
We required (mainly) following features:
alerts - we wanted to know about
incident as fast as possible
painless management - hosted service wouldbe
the best
visualizations - it's good to know what is going on and take some knowledge from the data
Because we didn't find suitable solution we started to write our own. Finally we've ended with up-and-running service called AlertGrid. (You can check it for free of course.)
The idea behind it is to provide an easy way to handle custom monitoring scenarios. Integration API is very simple (one function with two required parameters). At the momment we and others are using it for:
monitor scheduled tasks (cron jobs)
monitor entire application logic execution
alert on errors in applications
we are also working on examples of basic infrastructure monitoring using AlertGrid
This is such an open ended question, but I would start with physical measurements.
1. Are all the machines I think are hosting this site pingable?
2. Are all the machines which should be serving content actually serving some content? (Ideally this would be hit from an external network.)
3. Is each expected service on each machine running?
3a. Have those services run recently?
4. Does each machine have hard drive space left? (Don't forget the db)
5. Have these machines been backed up? When was the last time?
Once one lays out the physical monitoring of the systems, one can address those specific to a system?
1. Can an automated script log in? How long did it take?
2. How many users are live? Have there been a million fake accounts added?
...
These sorts of questions get more nebulous, and can be very system specific. They also usually can be derived reactively when responding to phsyical measurements. Hard drive fill up, maybe the web server logs got filled up because a bunch of agents created too many fake users. That kind of thing.
While plan A shouldn't necessarily be reactive, it is the way many a site setup a monitoring system.
Minimum: make sure it is running :)
However, some other stuff would be very useful. For example, the CPU load, RAM usage and (in multiuser systems) which user is running what. Also, for applications that access network, a list of network connections for each app. And (if you have access to client computer(s)) it would be cool to be able to see the 'window title' of the app - maybe check each 2-3 minutes if it changed and save it. Also, a list of files open by the application could be very useful, but it is not a must.
I think this is fairly simple - monitor so that you can be warned early enough before something goes wrong. That means monitor dependencies and the application itself.
It's really hard to provide specifics if you're not going to give details on the application you're monitoring, so I'd say use that as a general rule.
At a minimum you want to know that the system is healthy. This is subjective in what defines your system is healthy. Is it computers are up, the needed resources exist, the data is flowing through the system, the data is properly producing results, etc, etc.
In my project we do monitoring of most of this and then some. It really comes down to what is the highest level that you can use to analyze that everything is working. In our case we need to know down to the data output. If you just need to know down to the are these machines up it saves you on trying to show an inexperienced end user what is wrong.
There are also "off the shelf" tools that will do a lot of the hard work for you if you are just looking too hard into data results. I particularly liked Nagios when I was looking around but we needed more than it could easily show so I wrote our own monitoring system. Basically we also watch for "peculiarities" in the system, memory / cpu spikes, etc...
thanks everyone for the input, i was really looking for application-level monitoring not infrastructure monitoring, but it is good to know about both
the difference is:
infrastructure monitoring would be servers plus MS Exchange Server, Apache, IIS, and so forth
application monitoring would be user machines and the specific programs that they use to do their jobs, and/or servers plus the data-moving/backend applications that they run to keep the data flowing
sometimes it's hard to draw the line - an oversimplified definition might be "if your team wrote it, it's an application; if you bought it, it's infrastructure"
i think in practice it is best to monitor both
What you need to do is to break down the business process of the application and then have the software emit events at major business components. In addition, you'll need to create end to end synthetic transactions (eg. emulating end users clicking on a website). All that data would be fed into an monitoring tool. In the past, I've done JMX for applications of which flowed into Tivoli Monitoring's JMX Adapter and then I've done scripts that implement a "fake user" and then pipe in the results into Tivoli Monitoring's Script Adapter. Tivoli Monitoring takes the data and then creates application health and performance charts from that raw data.

Resources