Tool for monitoring QOS - monitoring

In my project
We crawls x number of server.
Number of user for each server varies from 1 to n.
We crawls 1 to z item for each user.
Currently we are monitoring QOS using graphite. We are storing time taken to crawl the item.
x.time_taken
Problem with this approach is that if only single user is affected we get false alert about QOS.
What will be the correct tool/technique to answer/monitor following points:
Alert only if minimum k user are affected. [Not number of events]
List of user which were affected.
I think graphite and statsd is not correct tool for this. What will be better tool for answering those two question ?

What you are asking for is often called Service Monitoring. For very good reasons you want to know the service impact of an event, rather than just that an event has happened.
The advantage of this approach is exactly as you state in your requirements - you can focus on events which impact a large part of your user base and you have a list of the users affected right away.
The main drawback, IMHO, is that Service Monitoring is usually much more complex than simple performance or event/alert monitoring. It also often relies on a service model, which in my experience is something that is hard to build and even harder to keep up to date.
For example if a server in your system shows a significant slow down or failure, depending on your architecture this may impact all users who use a service that relies on that server, or it may impact a very small subset, or even none at all initially, if there is a load balancing mechanism or redundancy mechanism in place.
You would need to reflect this architecture in your service monitoring model, and also change it every time you update your system architecture or deployment.
If your system is static enough or critical enough to warrant the investment then this may be worth your while. If not then a simple compromise may be just to update the graphing and alerting you are doing to alert when the average response time over a set number of users, or over all users on a server increases by a significant amount.
This may give you most of the benefits you are after without having to invest in the extra complexity of a service monitoring solution.
If you definitely are looking to expand your monitoring approach and want to stick with open source tools then I would start by looking at NAGIOS if your focus is on infrastructure, or there are quite a few web service monitoring solutions with Free Tiers such as pingdom:
http://www.nagios.org
https://www.pingdom.com

Related

System replacement technology

I have some questions that I am hoping someone out there will be able to answer for me.
Our situation is that we are considering a ground up replacement for an existing system. Firstly I will describe the existing system that we have.
We are currently operating on a pure object stack. The enviroment is OO and the database is OO. We currently have 3-4 million lines of code which was developed by 2-3 people, and we currently have a development team of 6, which continues to develop. The initial development started in 1997, and we have many clients installed. The environment is 64-bit, language and database, mulit-lingual, and is UNICODE. The operatiung system we use is Windows (latest versions). We have a number of modules which are delivered via a thin client (not browser), and the bandwidth usage is very low (Operates on 64KB WAN network performance level which is still prevalent in some countries in which we operate, i.e. the infrastructure is poor). Our biggest implementation is for one of the biggest companies in the world, and the target is to deliver the functionality for 30+ countries from one system instance (one physical db) for that client, and deliver the functionality using thin client to all countries from one set of application servers (the application servers are located with the db server and perform all of the processing), the thin client deals with the interactions with the users and the display of the data and collection of the data only. The system is used by 1000s of users, on the thin client. We also have mobile and portal components also, which are developed in C#, they are a small segment of the overall system and connect using APis. There are maybe 1000 mobile application users, with a final number expected to be 5000 mobile users. Within the system there will be 500000-1000000 vendors, with each vendor expected to have at least two transactions every single day.
The DB itself is partitioned, and replicated to a number of locations in real time. The final size of the DB when implementation is complete is expected to be in the 2TB range, and the current system will deal with that, no problem. The way the replication works is that there are mutiple replicated enviroments on hot-standy, i,e. all application servers and API servers are replicated. Our largest client routinely (once per month) performs scheduled windows updates, and when this occurs the primary environments are automatically rolled over to the secondaries, so the system remains available all of the time. In subsequent months, the system is rolled back to the primaries, this transition is very fast, i.e. real time.
At our largest client, the system was installed in 2014, and since that time it has not experienced any outage, except for planned outages because of server maintenace of whateveer in that time period, i.e. it has not crashed or faulted in the first three years of operation. For the purposes of providing updates and enhanced functionality to the target organisation or specifically one of their subsiduries in the countries in which they operate we are able to make changes to the system, via the loading of functional updates on-line. This is a very important component of my question, as for many years we have been able to update at one central location and have the new functionality immeadiately available to all users in all countries whilst they are continuosly using the application. This is without change to any .EXE or .DLL or whatever files that the end user is operating. This is a huge advantage for us currently, as many of the organisations we provide services to do NOT allow any change to EXE or DLL files on end user devices, and there is generally some approval process which takes some days and requires manual intervention by the users to make this process happen.
For further information, we have a support team of 6 providing support services to all of our clients in all of these countries, we operate three shifts of 2 people to provide these services. So this should give you some background to the stability of the system and the level of support we provide. Our service level is described as outstanding. We do have of course SLA agreements in place and we have not violated any SLA term ever.
So, now for my question. What technology would people choose to replace such a system, and how many people would it take to replace ? It has been recommended to me that C# and SQL server be used to replace this, and that it would take a couple of good people a year or two to re-develop from the ground up (we have all of the functional specifications from the last 20 years to work from). However, without having in depth knowledge of this technology stack I am quite concerned about the time period (I think it is very optimistic), I am concerned about the scaleability of the SQL server, and most importantly I am deeply concerned that we will loose this advantage that we have enjoyed that allows us to change the functionality of the current system via updates online without effecting logged on users. I am told that this sort of thing is just not possible in C# and if we have to provide an update to fix a bug, or provide new functionality then all users will have to replace the effected EXE and DLL files, i.e. all of them, 1000s of users would have to do this each and every time we update. This would be done automatically via a process called OneClick, but I am assuming if there is a company policy within our client environment that EXE changes are not allowed, then OneClick will not be viable. I am told if we took a browser approach to the new development then any updates would be server side (which is better), but, would still require an outage to apply updates.
Finally, more information on the online updates that are now possible. Currently all of the systems are replicated for disaster recovery and 100% uptime during update purposes. When we currently update our systems (at one central location) those logical updates are automatically applied at all replicated systems also without user intervention. Another concern that I have is that as well as the problem we face with updating multiple locations with the same update, which it seems is a requirement in C# or so I am being told, we will also have all of the replicated systems to update manually as well. As you can see our support team is small, so I am worried about a future blowout in maintenance resources required to maintain all of this, and then the cost in terms of times fixing mistakes that may creep in with all these additional tasks that may be required to perform the same exercise that we currently do only once.
Finally, a final peice of information on how we currently do updates. If the update is structural in nature, i.e. changes the physical structure of the database, then an outage is required, a full system down outage. When we apply the update the structural change is made, and this is automatically replicated across all secondary (standby enviroments). The users are not effected in terms of the software for the thin client or browsers. They simply log back on after the outage is complete. We currently have a window at a set time, once per month to perform these updates, however, it is rarely required. Once per week, we have a window for functional changes to be applied, and these are appled on line whilst the users are all on line performing their daily and periodic tasks.
So, if anyone out there can give me some insight into what technologies are available for such a system replacement or whether C# and SQL server can provide the necessary services and performance we actually need, i.e. I would be particularly interested to know whether in fact C# applications can be updated in real time, then that would be fantastic. We are obviously in the very early stages of this process in terms of how this should be done, so any information you can provide would be greatly appreciated and will save many hours of research.
Thank you in advance.
From the basic requirements you describe, my first thought is that you should probably adopt a full Web-based solution for your system, that way all updates can be done centrally without too much negative effect on your client access.
But if I understand correctly your question, one aspect you're requiring is to have executable code ready at the client-side (so a pure Web-solution won't work).
In that case, something that can quickly & easily update at the client side is needed.
We've been using the node.js and MongoDB stack for a few years now, there are some quite interesting effects of using pure scripts for your business logic: besides being easy to develop, the scripts themselves, when designed with certain guidelines, can perform "hot reload" on the fly to update your business logic. So this is what I'd recommend trying / looking at.
Efficiency of node.js and the flexibility provided by NoSQL DB such as MongoDB is well described in many places if you do a simple Google search.

Erlang OTP based application - architecture ideas

I'm trying to write an Erlang application (OTP) that would parse a list of users and then launch workers that will work 24X7 to collect user-data (using three different APIs) from remote servers and store it in ets.
What would be the ideal architecture for this kind of application. Do I launch a bunch of workers - one for each user (assuming small number users)? What will happen if number of users increases very rapidly?
Also, to call different APIs I need to put up a Timer mechanism in the worker process.
Any hint will be really appreciated.
Spawning new process for each user is not a such bad idea. There are http servers that do this for each connection, and they doing quite fine.
First of all cost of creating new process is minimal. And cost of maintaining processes is even smaller. If one of the has nothing to do, it won't do anything; there is none (almost) runtime overhead from inactive processes, which in the end means that you are doing only the work you have to do (this is in fact the source of Erlang systems reactivity).
Some issue might be memory usage. Each process has it's own memory stack, and in use-case when they actually do not need to store any internal data, you might be allocating some unnecessary memory. But this also could be modified (even during runtime), and in most cases such memory will be garbage collected.
Actually I would not worry about such things too soon. Issues you might encounter might depend on many things, mostly amount of outside data or user activity, and you can not really design this. Most probably you won't encounter any of them for quite some time. There's no need for premature optimization, especially if you could bind yourself to design that would slow down rest of your development process. In Erlang, with processes being main source of abstraction you can easily swap this process-per-user with pool-of-workers, and ets with external service. But only if you really need it.
What's most important is fact that representing "user" as process would be closest to problem domain. "Users" are independent entities, and deserve separate processes (they have their own state, and they can act or react independent to each other). It is quite similar to using Objects and Classes in other languages (it is over-simplification, but it should get you going).
If you were writing this in Python or C++ would you worry about how many objects you were creating? Only in extreme cases. In Erlang the same general rule applies for processes. Don't worry about how many you are creating.
As for architecture, the only element that is an architectural issue in your question is whether you should design a fixed worker pool or a 1-for-1 worker pool. The shape of the supervision tree would be an outcome of whichever way you choose.
If you are scraping data your real bottleneck isn't going to be how many processes you have, it will be how many network requests you are able to make per second on each API you are trying to access. You will almost certainly get throttled.
(A few months ago I wrote a test demonstration of a very similar system to what you are describing. The limiting factor was API request limits from providers like fb, YouTube, g+, Yahoo, not number of processes.)
As always with Erlang, write some system first, and then benchmark it for real before worrying about performance. You will usually find that performance isn't an issue, and the times that it is you will discover that it is much easier to optimize one small part of an existing system than to design an optimized system from scratch. So just go for it and write something that basically does what you want right now, and worry about optimization tweaks after you have something that basically does what you want. After getting some concrete performance data (memory, request latency, etc.) is the time to start thinking about performance.
Your problem will almost certainly be on the API providers' side or your network latency, not congestion within the Erlang VM.

Identifying poor performance in an Application

We are in the process of building a high-performance web application.
Unfortunately, there are times when performance unexpectedly degrades and we want to be able to monitor this so that we can proactively fix the problem when it occurs, as opposed to waiting for a user to report the problem.
So far, we are putting in place system monitors for metrics such as server memory usage, CPU usage and for gathering statistics on the database.
Whilst these show the overall health of the system, they don't help us when one particular user's session is slow. We have implemented tracing into our C# application which is particularly useful when identifying issues where data is the culprit, but for performance reasons tracing will be off by default and only enabled when trying to fix a problem.
So my question is are there any other best-practices that we should be considering (WMI for instance)? Is there anything else we should consider building into our web app that will benefit us without itself becoming a performance burden?
This depends a lot on your application, but I would always suggest to add your application metrics into your monitoring. For example number of recent picture uploads, number of concurrent users - I think you get the idea. Seeing the application specific metrics in combination with your server metrics like memory or CPU sometimes gives valuable insights.
In addition to system health monitoring (using Nagios) of parameters such as load, disk space, etc.., we
have built-in a REST service, called from Nagios, that provides statistics on
transactions pers second (which makes sense in our case)
number of active sessions
the number of errors in the logs per minute
....
in short, anything that is specific to the application(s)
monitor the time it takes for a (dummy) round trip transaction: as if an user or system was performing the business function
All this data being sent back to Nagios, we then configure alert levels and notifications.
We find that monitoring the number of Error entries in the logs gives some excellent short term warnings of a major crash/issue on the way for a lot of systems.
Many of our customers use Systems and Application Monitor, which handles the health monitoring, along with Synthetic End User Monitor, which runs continuous synthetic transactions to show you the performance of a web application from the end-user's perspective. It works for apps outside and behind the firewall. Users often tell us that SEUM will reveal availability problems from certain locations, or at certain times of day. You can download a free trial at
SolarWinds.com.

NServiceBus appropriate for load distribution of periodic tasks

Would NServiceBus or an equivalent ESB be appropriate for an application that has a bunch of different kinds of background maintenance-type tasks? For example:
Scanning databases for the occurence of certain words in user-generated content
Updating database tables that store the results of relatively expensive queries
Creating/maintaining external indexes for content
Sending event notification emails for a scheduled event.
My idea is to employ some kind of task scheduler (either the Windows builtin one, Quartz.NET, or my own database-based solution) to publish different kinds messages onto the bus periodically. The period may be as short as one minute or as long as a days. The reason I want to use the bus is so that I can scale out the number of subscribers as the system becomes larger and busier and the tasks become either more frequent or more resource-intensive. It would also provide redundancy as long as I always have at least two subscribers running.
The obvious alternative to this would be to write my own Windows Service that is triggered by the scheduler and performs the work, but I feel like making that scale beyond a single machine and provide fault tolerance might be more difficult than using the ESB as that plumbing.
Does this sound like a reasonable approach? Alternative suggestions?
TIA
As the author of NServiceBus, I'm quite probably biased, but there is a tradeoff between learning a new technology and writing (possibly a simpler version of) your own. I would recommend considering the longer term maintainance (and documentation) costs of your own solution as compared to one written in house.
In terms of the feature-set you described, NServiceBus does provide facilities for all of that.

Rails hosting advice - EngineYard, Heroku, EC2

I'm developing a very sensitive application for a client that needs to have 99.9999999999% uptime guarantee.
It's a Rails application with MySQL database. I am thinking of hosting it on EngineYard due the low maintenance requirements and easiness to run.
Heroku does not seems to be the perfect solution due to uptime problems.
EC2 can also be a good solution but maybe it requires too much work to install and maintain.
My question is: how to make a redundant system using EngineYard, Heroku, EC2 or any other Rails hosting that you propose? Do I need to have 2 instances in different places of the world being replicated? Please advise the best way.
Regards.
Everyone wants 100% uptime, but achieving it is pretty much impossible. Since down-time can be caused by any of the links in the chain, and there usually are dozens, to achieve such a high standard you will need to buy gold-plated everything. Essentially, you'll have to spend a fortune. The difference between 99% uptime, which means your site is unavailable for around 88 hours a year, and 99.9% uptime, where it's less than ten hours is considerable, and from there to 99.99% is even higher, where the tolerance is under an hour for a full year.
Going beyond 99.99% is simply impractical. Nobody will sign a guarantee like this unless they're being dishonest, the agreement so loaded down with caveats as to be unenforcable, or don't mind dishing out heavy credits all the time. Amazon EC2's SLA is 99.99% for instance.
The metrics I've seen collected on a provider like Linode shows uptimes of about 99.97% to 99.99%. Occasionally you will see datacenters with 100% uptime, but this is the network level only and doesn't take into account intermittent internal glitches that may knock your server offline.
Choosing a managed hosting provider like Engine Yard might be the solution for you, because it can minimize your exposure to random events, but it won't get you such a high uptime in and of itself. They're very good at maintaining the system layer, but their ability to fix or work-around bugs in your application is very limited, and they are subject to the same intermittent networking issues with EC2 as anyone else.
There are two kinds of reliability you should be concerning yourself with. One is availability, which is purely a measure of how likely a client is to be able to use the application. The other is data integrity, which is a measure of how likely data is to be retained given any number of disaster scenarios.
Most people will accept that an application might be down every so often for brief periods of time, but people refuse to accept that data may go missing every now and then.
It isn't hard to get a "99.9999999999%" data retention rate, but you will need to plan out your backup, replication, and recovery strategy in detail and will have to exercise your systems regularly to ensure they are working as designed.
Where you have almost no control over the often patchy routing on the internet in general, the defect rate in the hardware of your server, the power in your data center and so on, you do have a huge amount of control over your backup strategy.
EY uses a company called TerraMark for their hosting, which is some pretty serious hosting infrastructure. Out of the 3 you listed, I would go with them.
For up time, you want to look at master/slave replication of your data, automatic failover, and you want to build redundancy wherever you can. High availability is a fairly involved topic, and has more to do with IT then dev, I would recommend asking where to start over at serverfault.com.

Resources