My team is responsible for a high traffic website, which is very dynamic with about 3.5 million unique urls. We deploy our application about 1 per week, we have a CMS that updates about 100 updates per week and our internal data source releases about 1 a week as well, and we consume about 10 other public webservices. It is always our teams responsibly to make sure everything is up and running.
We use pingdom to make sure some of are up, but it is limited to a few checks and it does not handle as many urls as we need.
We use Nagios as well but it is a bit of a black box and has not been fully adopted by our development team. Most of our developers are windows focused and cringe at the thought of all the configuration.
Most of what we need is just monitoring a few urls, and something that can notify me when things go down or change.
I think you should something like unit testing, that does internal testing of every website release before and after deoyment. Your application should have excellent exception handling as well and an exception should be logged and monitored.
If you use external monitoring with a tool like pingdom or www.downnotifier.com you can check one url per pagetime. For example: one news article, one text page and one product page.
Related
I have some questions that I am hoping someone out there will be able to answer for me.
Our situation is that we are considering a ground up replacement for an existing system. Firstly I will describe the existing system that we have.
We are currently operating on a pure object stack. The enviroment is OO and the database is OO. We currently have 3-4 million lines of code which was developed by 2-3 people, and we currently have a development team of 6, which continues to develop. The initial development started in 1997, and we have many clients installed. The environment is 64-bit, language and database, mulit-lingual, and is UNICODE. The operatiung system we use is Windows (latest versions). We have a number of modules which are delivered via a thin client (not browser), and the bandwidth usage is very low (Operates on 64KB WAN network performance level which is still prevalent in some countries in which we operate, i.e. the infrastructure is poor). Our biggest implementation is for one of the biggest companies in the world, and the target is to deliver the functionality for 30+ countries from one system instance (one physical db) for that client, and deliver the functionality using thin client to all countries from one set of application servers (the application servers are located with the db server and perform all of the processing), the thin client deals with the interactions with the users and the display of the data and collection of the data only. The system is used by 1000s of users, on the thin client. We also have mobile and portal components also, which are developed in C#, they are a small segment of the overall system and connect using APis. There are maybe 1000 mobile application users, with a final number expected to be 5000 mobile users. Within the system there will be 500000-1000000 vendors, with each vendor expected to have at least two transactions every single day.
The DB itself is partitioned, and replicated to a number of locations in real time. The final size of the DB when implementation is complete is expected to be in the 2TB range, and the current system will deal with that, no problem. The way the replication works is that there are mutiple replicated enviroments on hot-standy, i,e. all application servers and API servers are replicated. Our largest client routinely (once per month) performs scheduled windows updates, and when this occurs the primary environments are automatically rolled over to the secondaries, so the system remains available all of the time. In subsequent months, the system is rolled back to the primaries, this transition is very fast, i.e. real time.
At our largest client, the system was installed in 2014, and since that time it has not experienced any outage, except for planned outages because of server maintenace of whateveer in that time period, i.e. it has not crashed or faulted in the first three years of operation. For the purposes of providing updates and enhanced functionality to the target organisation or specifically one of their subsiduries in the countries in which they operate we are able to make changes to the system, via the loading of functional updates on-line. This is a very important component of my question, as for many years we have been able to update at one central location and have the new functionality immeadiately available to all users in all countries whilst they are continuosly using the application. This is without change to any .EXE or .DLL or whatever files that the end user is operating. This is a huge advantage for us currently, as many of the organisations we provide services to do NOT allow any change to EXE or DLL files on end user devices, and there is generally some approval process which takes some days and requires manual intervention by the users to make this process happen.
For further information, we have a support team of 6 providing support services to all of our clients in all of these countries, we operate three shifts of 2 people to provide these services. So this should give you some background to the stability of the system and the level of support we provide. Our service level is described as outstanding. We do have of course SLA agreements in place and we have not violated any SLA term ever.
So, now for my question. What technology would people choose to replace such a system, and how many people would it take to replace ? It has been recommended to me that C# and SQL server be used to replace this, and that it would take a couple of good people a year or two to re-develop from the ground up (we have all of the functional specifications from the last 20 years to work from). However, without having in depth knowledge of this technology stack I am quite concerned about the time period (I think it is very optimistic), I am concerned about the scaleability of the SQL server, and most importantly I am deeply concerned that we will loose this advantage that we have enjoyed that allows us to change the functionality of the current system via updates online without effecting logged on users. I am told that this sort of thing is just not possible in C# and if we have to provide an update to fix a bug, or provide new functionality then all users will have to replace the effected EXE and DLL files, i.e. all of them, 1000s of users would have to do this each and every time we update. This would be done automatically via a process called OneClick, but I am assuming if there is a company policy within our client environment that EXE changes are not allowed, then OneClick will not be viable. I am told if we took a browser approach to the new development then any updates would be server side (which is better), but, would still require an outage to apply updates.
Finally, more information on the online updates that are now possible. Currently all of the systems are replicated for disaster recovery and 100% uptime during update purposes. When we currently update our systems (at one central location) those logical updates are automatically applied at all replicated systems also without user intervention. Another concern that I have is that as well as the problem we face with updating multiple locations with the same update, which it seems is a requirement in C# or so I am being told, we will also have all of the replicated systems to update manually as well. As you can see our support team is small, so I am worried about a future blowout in maintenance resources required to maintain all of this, and then the cost in terms of times fixing mistakes that may creep in with all these additional tasks that may be required to perform the same exercise that we currently do only once.
Finally, a final peice of information on how we currently do updates. If the update is structural in nature, i.e. changes the physical structure of the database, then an outage is required, a full system down outage. When we apply the update the structural change is made, and this is automatically replicated across all secondary (standby enviroments). The users are not effected in terms of the software for the thin client or browsers. They simply log back on after the outage is complete. We currently have a window at a set time, once per month to perform these updates, however, it is rarely required. Once per week, we have a window for functional changes to be applied, and these are appled on line whilst the users are all on line performing their daily and periodic tasks.
So, if anyone out there can give me some insight into what technologies are available for such a system replacement or whether C# and SQL server can provide the necessary services and performance we actually need, i.e. I would be particularly interested to know whether in fact C# applications can be updated in real time, then that would be fantastic. We are obviously in the very early stages of this process in terms of how this should be done, so any information you can provide would be greatly appreciated and will save many hours of research.
Thank you in advance.
From the basic requirements you describe, my first thought is that you should probably adopt a full Web-based solution for your system, that way all updates can be done centrally without too much negative effect on your client access.
But if I understand correctly your question, one aspect you're requiring is to have executable code ready at the client-side (so a pure Web-solution won't work).
In that case, something that can quickly & easily update at the client side is needed.
We've been using the node.js and MongoDB stack for a few years now, there are some quite interesting effects of using pure scripts for your business logic: besides being easy to develop, the scripts themselves, when designed with certain guidelines, can perform "hot reload" on the fly to update your business logic. So this is what I'd recommend trying / looking at.
Efficiency of node.js and the flexibility provided by NoSQL DB such as MongoDB is well described in many places if you do a simple Google search.
Currently after a build/deployment of our app (58 projects, large asp.net MVC 3 front end) takes ~15-20secs to load as it goes through the whole 'recycling the app pool' (release configuration).
We do have a web farm if that alters people's answers, but the question really is:
What are people doing in large scale applications where a maintenance window isn't viable (we're a 24/7 very active website) to minimize that initial 'first hit' on the app pool recycle after a deploy?
We've used a number of tools to analyze that startup time and there doesn't really seem to be any way to bring it down so what I'm looking for are what techniques do people employ in order to minimize the impact of a large application deploy affecting users.
By default - if you change 15 files in an ASP.NET application at once (even via FTP) then the app pool is automatically recycled. You can change the number of files but as soon as web.config and bin files are changed then it needs to recycle. So in my opinion the ideal solution for an environment like yours would be as follows:
4 web servers (this is an arbitrary number)
each server has a status.aspx that the load balancer looks at - use TeamCity to take 2 of these servers "off line" (off the load balancer) and wait 20 seconds for the traffic to filter across. A distributed cache will help keep user experience problems
Use TeamCity to deploy to those 2 servers - run your automated tests etc. and once you are happy put those back into the farm and take the other 2 offline and deploy to those
This can all be scripted / automated. The only issue with this is any schema changes that are not backwards compatible may not allow running the new version site in parallel with old version of the site for the 20 seconds for the load balancer to kick back in
This is good old fashioned Canary Releasing - there are some patterns here http://continuousdelivery.com/patterns/ to help take into consideration. Id also suggest a copy of that continuous delivery book - its like a continuous delivery bible and has got me out of a few situations :)
At the very base you could run a tinyget script against the application after completion of deployment which will "warm up" the application however if a customer hits your site before the script can run, they will still face a delay. What do you currently have in place, what post deployment steps do you have in place?
In a farm environment you could stage deployments too, so take one server out of load balance, update it and then bring that online after deployment and take the other out, complete the deployment and then reintroduce into the farm. How is your SQL Server setup - clustered?
copy and paste from my post here
We operate a Blue/Green deployment strategy on a 4 tier architecture which has a web site over 4 servers at the top tier. Due to the complexity the architecture introduced for deployments, we needed a way to deploy without disturbing any traffic to the "live" site. Following Fowler's advice, but not quite in the same way, we came up with a solution that means we have 2 sites on each server (a blue and a green, or in our case site A and site B). The live site has the appropriate host header, and once we have deployed and tested to the non-live site, we then flip the headers of the 2 sites so that what was once live is now the non-live site, and vice-versa. The effect is, a robust deployment that can be done in business hours and with the highest level of confidence.
This of course complicates your configuration and deployment slightly, but it's worth the effort. I guess it kind of goes without saying that you want to script both the deployment, and the host header swapping.
Firstly, unless you're running Google or something bigger, does a 15-20s load time at 3am for a handful of users really impact that much? I'd say the effort invested in eliminating the occasional lag would far outweigh the 15-20s inconvenience of a couple of users.
I consider it a necessary evil of using ASP.NET unfortunately. Using a pre-compiled site (.DLLs instead of the code-behind files) will lessen the time but not necessarily eliminate it.
The best thing you can do is use something like a status notification bar to warn users they may experience some "issues" during "essential maintenance".
But even then, I'd say in terms of user experience it'd be better to keep quiet and have a handful of people blame their "slow internet" when your site takes 20s to load on one occasion, than announce to all and sundry that it will be slow.
You can also try this approach : http://weblogs.asp.net/scottgu/archive/2009/09/15/auto-start-asp-net-applications-vs-2010-and-net-4-0-series.aspx
without knowing anything about your site, my first thought is that you might be able to break it down into smaller sites so that they start faster individually.
second, with your web farm, i assume you have some sort of load balancing device in front of that from which you can pull machines out of the pool when they are being deployed. don't put them back in the pool until after you have sent a request against the site to get it started up. you should be able to script this such that you are pretty much clicking a button that takes a machine out, deploys to it, and sends a request after it's back up and happy.
You can consider using aspnet_compiler.exe to precompile your application, because I think the delay after deployment is caused by the compilation phase rather than "whole recycling the app pool".
I'm writing a Rails web service that interacts with various pieces of hardware scattered throughout the country.
When a call is made to the web service, the Rails app then attempts to contact the appropriate piece of hardware, get the needed information, and reply to the web client. The time between the client's call and the reply may be up to 10 seconds, depending upon lots of factors.
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
I basically see two options. Either run JRuby and use multithreading or else run several regular Ruby instances and hope that not many people try to use the service at a time. JRuby seems like the much better solution, but it still doesn't seem to be mainstream and have out of the box support at Heroku and EngineYard. The multiple instance solution seems like a total kludge.
1) Am I right about my two options? Is there a better one I'm missing?
2) Is there an easy deployment option for JRuby?
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
From an engineering perspective, this seems like it would be the best alternative.
Why don't you want to do it?
There's a third option: If you host your Rails app with Passenger and enable global queueing, you can do this transparently. I have some actions that take several minutes, with no issues (caveat: some browsers may time out, but that may not be a concern for you).
If you're worried about browser timeout, or you cannot control the deployment environment, you may want to process it in the background:
User requests data
You enter request into a queue
Your web service returns a "ticket" identifier to check the progress
A background process processes the jobs in the queue
The user polls back, referencing the "ticket" id
As far as hosting in JRuby, I've deployed a couple of small internal applications using the glassfish gem, but I'm not sure how much I would trust it for customer-facing apps. Just make sure you run config.threadsafe! in production.rb. I've heard good things about Trinidad, too.
You can also run the web service call in a delayed background job so that it's not hogging up a web-server and can even be run on a separate physical box. This is also a much more scaleable approach. If you make the web call using AJAX then you can ping the server every second or two to see if your results are ready, that way your client is not held in limbo while the results are being calculated and the request does not time out.
I work in an IT department that is divided into two groups. One group develops and manages applications, the other manages the company's infrastructure and servers. One of the problems we face is a break down in communication. I work for the application group and one of the problems I have is not being notified when a server is taken down by infrastructure, or a database is being refreshed.
Does anyone have suggestions on how to improve communications between the two groups or any ideas on how to keep a light-weight log across multiple systems (both linux and windows)? Ideally it would be nice if we could have our boxes just tweet their statuses or something.
Thanks for the help,
Ben
One thing you could do to communicate server status is to have our Infrastructure group setup a network monitoring system like Nagios. This will give everyone in your application group the ability to get a snapshot view of the status of every server in the system. Having this kind of status is invaluable when you are doing development.
Nagios gives you network monitoring, but also allows you to show scheduled down time for a particular server in the system.
Another thing your group could do to foster communication with the Infrastructure is to have your build system report which servers it is currently using for building and testing your products.
Also, setting up regular meeting between stakeholders of both groups is probably a good idea too. If you all are talking to each other, even for 15 minute a week, you'll probably see incidents like the one you described above go down quite a bit.
I think this is a bigger issue of change control.
You should have hardware and software change control and an approval process.
Ultimately, infrastructure serves you - the purpose for IT infrastructure is to run applications.
In my current large financial data company, servers are not TOUCHED without proper authorization through the client and application groups. It seems like a huge pain, but every single server is there for a reason - to meet a specific business goal and run a specific application. There is simply no excuse for the infrastructure group to be changing things or upsetting servers on their own volition.
Response to critical hardware failure might be an exception.
Needed software and OS updates are handled through scheduled maintenance windows and an approved change process.
I like the Nagios idea as well. If you want to setup something that's more of a communication tool, I would recommend a content management system like Drupal.
We use Drupal internally to communicate between teams. When one team takes a server down, they would add an event into Drupal. The rest of us would either get it as an email, an RSS item or just by refreshing the page.
Implement a change control process where changes are submitted, approved and scheduled for BOTH groups. This lets everyone know what is going on. This process can be as light or heavy-weight as you want.
We currently use Hp SiteScope for monitoring synthetic transactions across some of our web apps. This works pretty well except for the licensing cost for each synthetic transaction makes it prohibitive to ensure adequate coverage across our applications.
So, an alternative would be to use SiteScope's URL monitoring which can basically call a URL and then provide some basic checks for the certain strings. With that approach, I'd like to create a page that either calls a bunch of pages or try to tap into a MSTest group somehow to run tests.
In the end, I'd like a set of test cases that can be used against multiple environments to be used for production verification, uptime, status, etc.
Thanks,
Matt
Have you taken a look at System Center Operations Manager 2007?
I'm just getting started, but it appears to do what you are describing in your question.
We are looking to monitoring our data center and the a web application...from the few things I have found on the web it is going to fit our need.
Update
I've since moved to Application Insights. A great overview can be found here, https://azure.microsoft.com/en-us/documentation/articles/app-insights-monitor-web-app-availability/
There are two methods one can use, a simple ping, or record a multi-step synthetic user "experience". Basically you act as a user, and using IE and a Visual Studio Web Test project you record navigating around your site and upload that file to Azure.
For example, I record logging in, navigating a few pages, and then logging out. As long as all of those events happen in a timely manner the site is in a good operating state.
If the tests fail, take too long to respond for example, I'll get an email alerting me something isn't exactly right.