Keeping applications and infrastructure connected - communication

I work in an IT department that is divided into two groups. One group develops and manages applications, the other manages the company's infrastructure and servers. One of the problems we face is a break down in communication. I work for the application group and one of the problems I have is not being notified when a server is taken down by infrastructure, or a database is being refreshed.
Does anyone have suggestions on how to improve communications between the two groups or any ideas on how to keep a light-weight log across multiple systems (both linux and windows)? Ideally it would be nice if we could have our boxes just tweet their statuses or something.
Thanks for the help,
Ben

One thing you could do to communicate server status is to have our Infrastructure group setup a network monitoring system like Nagios. This will give everyone in your application group the ability to get a snapshot view of the status of every server in the system. Having this kind of status is invaluable when you are doing development.
Nagios gives you network monitoring, but also allows you to show scheduled down time for a particular server in the system.
Another thing your group could do to foster communication with the Infrastructure is to have your build system report which servers it is currently using for building and testing your products.
Also, setting up regular meeting between stakeholders of both groups is probably a good idea too. If you all are talking to each other, even for 15 minute a week, you'll probably see incidents like the one you described above go down quite a bit.

I think this is a bigger issue of change control.
You should have hardware and software change control and an approval process.
Ultimately, infrastructure serves you - the purpose for IT infrastructure is to run applications.
In my current large financial data company, servers are not TOUCHED without proper authorization through the client and application groups. It seems like a huge pain, but every single server is there for a reason - to meet a specific business goal and run a specific application. There is simply no excuse for the infrastructure group to be changing things or upsetting servers on their own volition.
Response to critical hardware failure might be an exception.
Needed software and OS updates are handled through scheduled maintenance windows and an approved change process.

I like the Nagios idea as well. If you want to setup something that's more of a communication tool, I would recommend a content management system like Drupal.
We use Drupal internally to communicate between teams. When one team takes a server down, they would add an event into Drupal. The rest of us would either get it as an email, an RSS item or just by refreshing the page.

Implement a change control process where changes are submitted, approved and scheduled for BOTH groups. This lets everyone know what is going on. This process can be as light or heavy-weight as you want.

Related

System replacement technology

I have some questions that I am hoping someone out there will be able to answer for me.
Our situation is that we are considering a ground up replacement for an existing system. Firstly I will describe the existing system that we have.
We are currently operating on a pure object stack. The enviroment is OO and the database is OO. We currently have 3-4 million lines of code which was developed by 2-3 people, and we currently have a development team of 6, which continues to develop. The initial development started in 1997, and we have many clients installed. The environment is 64-bit, language and database, mulit-lingual, and is UNICODE. The operatiung system we use is Windows (latest versions). We have a number of modules which are delivered via a thin client (not browser), and the bandwidth usage is very low (Operates on 64KB WAN network performance level which is still prevalent in some countries in which we operate, i.e. the infrastructure is poor). Our biggest implementation is for one of the biggest companies in the world, and the target is to deliver the functionality for 30+ countries from one system instance (one physical db) for that client, and deliver the functionality using thin client to all countries from one set of application servers (the application servers are located with the db server and perform all of the processing), the thin client deals with the interactions with the users and the display of the data and collection of the data only. The system is used by 1000s of users, on the thin client. We also have mobile and portal components also, which are developed in C#, they are a small segment of the overall system and connect using APis. There are maybe 1000 mobile application users, with a final number expected to be 5000 mobile users. Within the system there will be 500000-1000000 vendors, with each vendor expected to have at least two transactions every single day.
The DB itself is partitioned, and replicated to a number of locations in real time. The final size of the DB when implementation is complete is expected to be in the 2TB range, and the current system will deal with that, no problem. The way the replication works is that there are mutiple replicated enviroments on hot-standy, i,e. all application servers and API servers are replicated. Our largest client routinely (once per month) performs scheduled windows updates, and when this occurs the primary environments are automatically rolled over to the secondaries, so the system remains available all of the time. In subsequent months, the system is rolled back to the primaries, this transition is very fast, i.e. real time.
At our largest client, the system was installed in 2014, and since that time it has not experienced any outage, except for planned outages because of server maintenace of whateveer in that time period, i.e. it has not crashed or faulted in the first three years of operation. For the purposes of providing updates and enhanced functionality to the target organisation or specifically one of their subsiduries in the countries in which they operate we are able to make changes to the system, via the loading of functional updates on-line. This is a very important component of my question, as for many years we have been able to update at one central location and have the new functionality immeadiately available to all users in all countries whilst they are continuosly using the application. This is without change to any .EXE or .DLL or whatever files that the end user is operating. This is a huge advantage for us currently, as many of the organisations we provide services to do NOT allow any change to EXE or DLL files on end user devices, and there is generally some approval process which takes some days and requires manual intervention by the users to make this process happen.
For further information, we have a support team of 6 providing support services to all of our clients in all of these countries, we operate three shifts of 2 people to provide these services. So this should give you some background to the stability of the system and the level of support we provide. Our service level is described as outstanding. We do have of course SLA agreements in place and we have not violated any SLA term ever.
So, now for my question. What technology would people choose to replace such a system, and how many people would it take to replace ? It has been recommended to me that C# and SQL server be used to replace this, and that it would take a couple of good people a year or two to re-develop from the ground up (we have all of the functional specifications from the last 20 years to work from). However, without having in depth knowledge of this technology stack I am quite concerned about the time period (I think it is very optimistic), I am concerned about the scaleability of the SQL server, and most importantly I am deeply concerned that we will loose this advantage that we have enjoyed that allows us to change the functionality of the current system via updates online without effecting logged on users. I am told that this sort of thing is just not possible in C# and if we have to provide an update to fix a bug, or provide new functionality then all users will have to replace the effected EXE and DLL files, i.e. all of them, 1000s of users would have to do this each and every time we update. This would be done automatically via a process called OneClick, but I am assuming if there is a company policy within our client environment that EXE changes are not allowed, then OneClick will not be viable. I am told if we took a browser approach to the new development then any updates would be server side (which is better), but, would still require an outage to apply updates.
Finally, more information on the online updates that are now possible. Currently all of the systems are replicated for disaster recovery and 100% uptime during update purposes. When we currently update our systems (at one central location) those logical updates are automatically applied at all replicated systems also without user intervention. Another concern that I have is that as well as the problem we face with updating multiple locations with the same update, which it seems is a requirement in C# or so I am being told, we will also have all of the replicated systems to update manually as well. As you can see our support team is small, so I am worried about a future blowout in maintenance resources required to maintain all of this, and then the cost in terms of times fixing mistakes that may creep in with all these additional tasks that may be required to perform the same exercise that we currently do only once.
Finally, a final peice of information on how we currently do updates. If the update is structural in nature, i.e. changes the physical structure of the database, then an outage is required, a full system down outage. When we apply the update the structural change is made, and this is automatically replicated across all secondary (standby enviroments). The users are not effected in terms of the software for the thin client or browsers. They simply log back on after the outage is complete. We currently have a window at a set time, once per month to perform these updates, however, it is rarely required. Once per week, we have a window for functional changes to be applied, and these are appled on line whilst the users are all on line performing their daily and periodic tasks.
So, if anyone out there can give me some insight into what technologies are available for such a system replacement or whether C# and SQL server can provide the necessary services and performance we actually need, i.e. I would be particularly interested to know whether in fact C# applications can be updated in real time, then that would be fantastic. We are obviously in the very early stages of this process in terms of how this should be done, so any information you can provide would be greatly appreciated and will save many hours of research.
Thank you in advance.
From the basic requirements you describe, my first thought is that you should probably adopt a full Web-based solution for your system, that way all updates can be done centrally without too much negative effect on your client access.
But if I understand correctly your question, one aspect you're requiring is to have executable code ready at the client-side (so a pure Web-solution won't work).
In that case, something that can quickly & easily update at the client side is needed.
We've been using the node.js and MongoDB stack for a few years now, there are some quite interesting effects of using pure scripts for your business logic: besides being easy to develop, the scripts themselves, when designed with certain guidelines, can perform "hot reload" on the fly to update your business logic. So this is what I'd recommend trying / looking at.
Efficiency of node.js and the flexibility provided by NoSQL DB such as MongoDB is well described in many places if you do a simple Google search.

How do you monitor your website?

My team is responsible for a high traffic website, which is very dynamic with about 3.5 million unique urls. We deploy our application about 1 per week, we have a CMS that updates about 100 updates per week and our internal data source releases about 1 a week as well, and we consume about 10 other public webservices. It is always our teams responsibly to make sure everything is up and running.
We use pingdom to make sure some of are up, but it is limited to a few checks and it does not handle as many urls as we need.
We use Nagios as well but it is a bit of a black box and has not been fully adopted by our development team. Most of our developers are windows focused and cringe at the thought of all the configuration.
Most of what we need is just monitoring a few urls, and something that can notify me when things go down or change.
I think you should something like unit testing, that does internal testing of every website release before and after deoyment. Your application should have excellent exception handling as well and an exception should be logged and monitored.
If you use external monitoring with a tool like pingdom or www.downnotifier.com you can check one url per pagetime. For example: one news article, one text page and one product page.

Architecture ideas for Rails 3 application

I was hoping to get some opinions regarding the architecture of our Rails 3 application.
Currently we have a Rails 3.0.7 application that allows users to manage content that is displayed on their TV (promotions, videos, menus, sports stats, etc.) through one of our connected media devices. We have over 1000 (and growing) of these connected devices that poll our system every minute to check for changes to their content, and every 15 minutes to report their stats (e.g. CPU, Memory, etc.).
One of the major advantages of our system is that we, as admins, can change how an individual content item looks/works and it is distributed to all devices that use it. The disadvantage of this feature is when we make a change our system becomes temporarily unusable because all the connected devices ask for their update about the same time.
Therefore, we plan on re-architecting our application so the content mgt. system is not impacted when the devices are communicating with the application. There are probably dozens of way to solve this issue. One way would be to have a separate Rails application that is only for the devices to get the content they should display, admins can monitor, etc.. It could share the model, database, etc. with the current content mgt. system. This way might prove difficult to manage the models, migrations, etc. I obviously don't want to duplicate the models. Also it would be ideal if the content mgt. system could still display status of devices for accounts so account admins can see if their devices are online, etc.
I'm thinking some type of queue mechanism is a good fit like resque/redis because when changes are made in the content mgt. system we could just queue a job which the device instance could pick up and process.
I wanted to toss this out to the community to get opinions and ideas from other folks that might have worked or are still working with systems that leverage connected devices. Thanks in advance for your contributions. I appreciate it!
Louis
1000+ clients with ~1 req. per minute each does not sound like a load that requires architectural changes for normal operation. Generally, simple one-app architecture will be easier to maintain in the long run, so you should try sticking to it until there're issues that cannot be solved.
If performance/responsiveness is the main issue, why not add a caching proxy server to the stack?
Other simple option is to install the app on two servers and use one for admin and other for client devices. Note that this will only help if database isn't the bottleneck.

Best way to run rails with long delays

I'm writing a Rails web service that interacts with various pieces of hardware scattered throughout the country.
When a call is made to the web service, the Rails app then attempts to contact the appropriate piece of hardware, get the needed information, and reply to the web client. The time between the client's call and the reply may be up to 10 seconds, depending upon lots of factors.
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
I basically see two options. Either run JRuby and use multithreading or else run several regular Ruby instances and hope that not many people try to use the service at a time. JRuby seems like the much better solution, but it still doesn't seem to be mainstream and have out of the box support at Heroku and EngineYard. The multiple instance solution seems like a total kludge.
1) Am I right about my two options? Is there a better one I'm missing?
2) Is there an easy deployment option for JRuby?
I do not want to split the web service call in two (ask for information, answer immediately with a pending reply, then force another api call to get the actual results).
From an engineering perspective, this seems like it would be the best alternative.
Why don't you want to do it?
There's a third option: If you host your Rails app with Passenger and enable global queueing, you can do this transparently. I have some actions that take several minutes, with no issues (caveat: some browsers may time out, but that may not be a concern for you).
If you're worried about browser timeout, or you cannot control the deployment environment, you may want to process it in the background:
User requests data
You enter request into a queue
Your web service returns a "ticket" identifier to check the progress
A background process processes the jobs in the queue
The user polls back, referencing the "ticket" id
As far as hosting in JRuby, I've deployed a couple of small internal applications using the glassfish gem, but I'm not sure how much I would trust it for customer-facing apps. Just make sure you run config.threadsafe! in production.rb. I've heard good things about Trinidad, too.
You can also run the web service call in a delayed background job so that it's not hogging up a web-server and can even be run on a separate physical box. This is also a much more scaleable approach. If you make the web call using AJAX then you can ping the server every second or two to see if your results are ready, that way your client is not held in limbo while the results are being calculated and the request does not time out.

Using MSTest as site/environment monitoring tool

We currently use Hp SiteScope for monitoring synthetic transactions across some of our web apps. This works pretty well except for the licensing cost for each synthetic transaction makes it prohibitive to ensure adequate coverage across our applications.
So, an alternative would be to use SiteScope's URL monitoring which can basically call a URL and then provide some basic checks for the certain strings. With that approach, I'd like to create a page that either calls a bunch of pages or try to tap into a MSTest group somehow to run tests.
In the end, I'd like a set of test cases that can be used against multiple environments to be used for production verification, uptime, status, etc.
Thanks,
Matt
Have you taken a look at System Center Operations Manager 2007?
I'm just getting started, but it appears to do what you are describing in your question.
We are looking to monitoring our data center and the a web application...from the few things I have found on the web it is going to fit our need.
Update
I've since moved to Application Insights. A great overview can be found here, https://azure.microsoft.com/en-us/documentation/articles/app-insights-monitor-web-app-availability/
There are two methods one can use, a simple ping, or record a multi-step synthetic user "experience". Basically you act as a user, and using IE and a Visual Studio Web Test project you record navigating around your site and upload that file to Azure.
For example, I record logging in, navigating a few pages, and then logging out. As long as all of those events happen in a timely manner the site is in a good operating state.
If the tests fail, take too long to respond for example, I'll get an email alerting me something isn't exactly right.

Resources