I was hoping to get some opinions regarding the architecture of our Rails 3 application.
Currently we have a Rails 3.0.7 application that allows users to manage content that is displayed on their TV (promotions, videos, menus, sports stats, etc.) through one of our connected media devices. We have over 1000 (and growing) of these connected devices that poll our system every minute to check for changes to their content, and every 15 minutes to report their stats (e.g. CPU, Memory, etc.).
One of the major advantages of our system is that we, as admins, can change how an individual content item looks/works and it is distributed to all devices that use it. The disadvantage of this feature is when we make a change our system becomes temporarily unusable because all the connected devices ask for their update about the same time.
Therefore, we plan on re-architecting our application so the content mgt. system is not impacted when the devices are communicating with the application. There are probably dozens of way to solve this issue. One way would be to have a separate Rails application that is only for the devices to get the content they should display, admins can monitor, etc.. It could share the model, database, etc. with the current content mgt. system. This way might prove difficult to manage the models, migrations, etc. I obviously don't want to duplicate the models. Also it would be ideal if the content mgt. system could still display status of devices for accounts so account admins can see if their devices are online, etc.
I'm thinking some type of queue mechanism is a good fit like resque/redis because when changes are made in the content mgt. system we could just queue a job which the device instance could pick up and process.
I wanted to toss this out to the community to get opinions and ideas from other folks that might have worked or are still working with systems that leverage connected devices. Thanks in advance for your contributions. I appreciate it!
Louis
1000+ clients with ~1 req. per minute each does not sound like a load that requires architectural changes for normal operation. Generally, simple one-app architecture will be easier to maintain in the long run, so you should try sticking to it until there're issues that cannot be solved.
If performance/responsiveness is the main issue, why not add a caching proxy server to the stack?
Other simple option is to install the app on two servers and use one for admin and other for client devices. Note that this will only help if database isn't the bottleneck.
Related
I have some questions that I am hoping someone out there will be able to answer for me.
Our situation is that we are considering a ground up replacement for an existing system. Firstly I will describe the existing system that we have.
We are currently operating on a pure object stack. The enviroment is OO and the database is OO. We currently have 3-4 million lines of code which was developed by 2-3 people, and we currently have a development team of 6, which continues to develop. The initial development started in 1997, and we have many clients installed. The environment is 64-bit, language and database, mulit-lingual, and is UNICODE. The operatiung system we use is Windows (latest versions). We have a number of modules which are delivered via a thin client (not browser), and the bandwidth usage is very low (Operates on 64KB WAN network performance level which is still prevalent in some countries in which we operate, i.e. the infrastructure is poor). Our biggest implementation is for one of the biggest companies in the world, and the target is to deliver the functionality for 30+ countries from one system instance (one physical db) for that client, and deliver the functionality using thin client to all countries from one set of application servers (the application servers are located with the db server and perform all of the processing), the thin client deals with the interactions with the users and the display of the data and collection of the data only. The system is used by 1000s of users, on the thin client. We also have mobile and portal components also, which are developed in C#, they are a small segment of the overall system and connect using APis. There are maybe 1000 mobile application users, with a final number expected to be 5000 mobile users. Within the system there will be 500000-1000000 vendors, with each vendor expected to have at least two transactions every single day.
The DB itself is partitioned, and replicated to a number of locations in real time. The final size of the DB when implementation is complete is expected to be in the 2TB range, and the current system will deal with that, no problem. The way the replication works is that there are mutiple replicated enviroments on hot-standy, i,e. all application servers and API servers are replicated. Our largest client routinely (once per month) performs scheduled windows updates, and when this occurs the primary environments are automatically rolled over to the secondaries, so the system remains available all of the time. In subsequent months, the system is rolled back to the primaries, this transition is very fast, i.e. real time.
At our largest client, the system was installed in 2014, and since that time it has not experienced any outage, except for planned outages because of server maintenace of whateveer in that time period, i.e. it has not crashed or faulted in the first three years of operation. For the purposes of providing updates and enhanced functionality to the target organisation or specifically one of their subsiduries in the countries in which they operate we are able to make changes to the system, via the loading of functional updates on-line. This is a very important component of my question, as for many years we have been able to update at one central location and have the new functionality immeadiately available to all users in all countries whilst they are continuosly using the application. This is without change to any .EXE or .DLL or whatever files that the end user is operating. This is a huge advantage for us currently, as many of the organisations we provide services to do NOT allow any change to EXE or DLL files on end user devices, and there is generally some approval process which takes some days and requires manual intervention by the users to make this process happen.
For further information, we have a support team of 6 providing support services to all of our clients in all of these countries, we operate three shifts of 2 people to provide these services. So this should give you some background to the stability of the system and the level of support we provide. Our service level is described as outstanding. We do have of course SLA agreements in place and we have not violated any SLA term ever.
So, now for my question. What technology would people choose to replace such a system, and how many people would it take to replace ? It has been recommended to me that C# and SQL server be used to replace this, and that it would take a couple of good people a year or two to re-develop from the ground up (we have all of the functional specifications from the last 20 years to work from). However, without having in depth knowledge of this technology stack I am quite concerned about the time period (I think it is very optimistic), I am concerned about the scaleability of the SQL server, and most importantly I am deeply concerned that we will loose this advantage that we have enjoyed that allows us to change the functionality of the current system via updates online without effecting logged on users. I am told that this sort of thing is just not possible in C# and if we have to provide an update to fix a bug, or provide new functionality then all users will have to replace the effected EXE and DLL files, i.e. all of them, 1000s of users would have to do this each and every time we update. This would be done automatically via a process called OneClick, but I am assuming if there is a company policy within our client environment that EXE changes are not allowed, then OneClick will not be viable. I am told if we took a browser approach to the new development then any updates would be server side (which is better), but, would still require an outage to apply updates.
Finally, more information on the online updates that are now possible. Currently all of the systems are replicated for disaster recovery and 100% uptime during update purposes. When we currently update our systems (at one central location) those logical updates are automatically applied at all replicated systems also without user intervention. Another concern that I have is that as well as the problem we face with updating multiple locations with the same update, which it seems is a requirement in C# or so I am being told, we will also have all of the replicated systems to update manually as well. As you can see our support team is small, so I am worried about a future blowout in maintenance resources required to maintain all of this, and then the cost in terms of times fixing mistakes that may creep in with all these additional tasks that may be required to perform the same exercise that we currently do only once.
Finally, a final peice of information on how we currently do updates. If the update is structural in nature, i.e. changes the physical structure of the database, then an outage is required, a full system down outage. When we apply the update the structural change is made, and this is automatically replicated across all secondary (standby enviroments). The users are not effected in terms of the software for the thin client or browsers. They simply log back on after the outage is complete. We currently have a window at a set time, once per month to perform these updates, however, it is rarely required. Once per week, we have a window for functional changes to be applied, and these are appled on line whilst the users are all on line performing their daily and periodic tasks.
So, if anyone out there can give me some insight into what technologies are available for such a system replacement or whether C# and SQL server can provide the necessary services and performance we actually need, i.e. I would be particularly interested to know whether in fact C# applications can be updated in real time, then that would be fantastic. We are obviously in the very early stages of this process in terms of how this should be done, so any information you can provide would be greatly appreciated and will save many hours of research.
Thank you in advance.
From the basic requirements you describe, my first thought is that you should probably adopt a full Web-based solution for your system, that way all updates can be done centrally without too much negative effect on your client access.
But if I understand correctly your question, one aspect you're requiring is to have executable code ready at the client-side (so a pure Web-solution won't work).
In that case, something that can quickly & easily update at the client side is needed.
We've been using the node.js and MongoDB stack for a few years now, there are some quite interesting effects of using pure scripts for your business logic: besides being easy to develop, the scripts themselves, when designed with certain guidelines, can perform "hot reload" on the fly to update your business logic. So this is what I'd recommend trying / looking at.
Efficiency of node.js and the flexibility provided by NoSQL DB such as MongoDB is well described in many places if you do a simple Google search.
I've been running a Rails app on 1 big dedicated server. Now for scaling I want to switch to a cloud service hoster and serve the app on 3 instances - App, DB and Redis.
I have really bad experience with Heroku performance wise and hence cost efficiency. So for me 2 Alternatives remain: Engineyard and Enterprise-Rails.
What I find important is that Engineyard doesn't offer an autoscaling option to handle peaks. On the other hand Enterprise-Rails doesn't have too much of documentation, most of it is handled by a support crew which is setting up everything.
What are other differences and what should I use for my website? I don't need much of administration work and I am not experienced with it. Basically I just want my Site to run optimally safe, stable and cost efficient without much personal work involved.
I am running a massive Rails app off AWS at this time and I'm really happy with it. Previously I had a number of dedicated boxes that were always causing problems - sooner or later one of them would crash for some reason, Raid failures, database problems whatnot.
At AWS I use RDS for database, elastic cache for caching, I keep all my code on a fat instance that acts as staging server and get a variable number of reserved instances to load the code via NFS.
I also use autoscaling - we've prepaid for a number of reserved instances and autoscaling helps starting up nodes when CPU usage goes above 60%, then removing them when it goes below 25%. autoscaling rules are based on cloudwatch alerts that can be set to monitor a particular group of instances, memcache servers, and so on, you even get e-mails and SMS notifications via SNS when certain scaling activities take place, say when more than 100 instances are spammed in less than 1 hour (massive traffic spike). The instances also get added right up to the load balancers by the way and you don't need to mess with the session store as you can use the sticky session feature which is quite nice.
Recently I also started using a 2nd launch group with spot instances, this complicated things a bit in terms of cloudwatch rules but I'm able to save a lot every month as spot prices are much lower. When the spot price (minimum) I bid is not enough, the set-up I have switches back to reserved instances.
Even more recently I've also started using CloudFront which got my app's page assets to load real fast (about 2 megs of CSS, JS, some icon sprites). Previously I was serving directly from instances via the load balancers.
This took about 20 hours to deploy, test and tune for maximum performance and availability.
One of the problems I have with AWS is that there's no support unless you're prepared to foot a bill. They claim some support is offered without a subscription but the only option in the support area is Billing. Ha. Fortunately it's all stable enough not to put me in a position where I'd have to pay for it.
Overall Rails fits in quite nice with AWS. I spend less than 2 hours per month doing maintenance, where I was spending over 30 previously. Most important for me is that I know that I can GTFO on a vacation for X months knowing nothing will cause any trouble - haven't had a monitoring alert more than a year.
Later edit: the app is a sports site with white labeling feature, lots of users, lots of administrators working on content in the back-end, database intensive as we show market pricing data that should update every few seconds. I had an average load time of about 3 seconds per page with dedicated servers that were doing about the same thing - database, memcache, storage, load balancing, web app. Now my average is under 1 second. Monthly bill is about 8 times lower now.
While Engine Yard doesn't offer auto-scaling (it is in the pipeline), we do have a fairly easy to use scaling feature that allows you to spin up multiple instances at once in times of need.
The advantages over something like Enterprise-Rails is the full documentation, the choice to deploy from the CLI or the dashboard,and our amazing support team. It's also easier to use Engine Yard and move from a personal machine or from another cloud setup than it is using a service such as AWS directly.
My organization is building a new version of our ticketing site and is looking for the best way to build an online waiting room when the number of users in our purchase path exceeds a certain limit. The best version of this queue would let new users in after existing users have either completed their purchase or have exceeded a timeout limit after entering the path.
I'm trying to get an idea of how this has been implemented by other organizations. Has anyone out there done something similar or have any experience with this? We have some ideas, but I'd like to get a sense of what solutions have been tried and what problems those solutions have run up against.
Just to be complete, this site is being built in Ruby on Rails, though I'd love to hear about how people have solved this regardless of platform.
Edit: To clarify: The need for the queue is not primarily to reduce load, but to limit the speed at which the web is purchasing tickets relative to people buying in other ways, like over the phone.
Before I outline one method for this, I want to point out that what you want to do doesn't make a lot of sense. Services on the web aren't like a physical store, where I can walk up and see that it's crowded and decide to stay or not. Queueing people on your site strikes me as shifting the blame from you (unable or unwilling to adequately provision resources) to me (punishing me for trying to use your site).
If you're selling something like show tickets, where quantity is limited and each item is tied to a seat, I think it's better to reserve items and time out those reservations if they aren't paid for in a timely manner. Ticketmaster does this, and I think it's a much better solution than blocking people at the door.
If you still want to go down this path, then I'd design the system like this:
As customers come to your site, record their arrival time. As they interact with the site, record a "last seen" time. "Last seen" will be used to determine activeness. You'll need a background job running very frequently to expire sessions quickly.
Once your limit is hit, you have an ordered queue of people who are blocked. As customers complete their transaction or time out, you'll mark the next person in the queue for entry into the purchase path.
For queued users, their browsers will make a request on a regular basis, checking to see if you've let them in yet. If yes, they proceed to the purchase path. If no, they continue to wait.
The purchase path needs a mechanism to check if someone is trying to circumvent your waiting area, and sends them back.
You might find the Online queuing for ticketing guide helpful. Check their repository at GitHub.
They've integration with Ruby On Rails, PHP, .NET, iOS, Android and similar platforms.
Queue-it enables you to gain control of website overload during extreme traffic peaks by offloading end users into an online queue.
When a peak traffic event occurs on a website, the online queue system sends users to the virtual waiting room environment where the users wait and are redirected back to the website at a rate it can handle.
I work in an IT department that is divided into two groups. One group develops and manages applications, the other manages the company's infrastructure and servers. One of the problems we face is a break down in communication. I work for the application group and one of the problems I have is not being notified when a server is taken down by infrastructure, or a database is being refreshed.
Does anyone have suggestions on how to improve communications between the two groups or any ideas on how to keep a light-weight log across multiple systems (both linux and windows)? Ideally it would be nice if we could have our boxes just tweet their statuses or something.
Thanks for the help,
Ben
One thing you could do to communicate server status is to have our Infrastructure group setup a network monitoring system like Nagios. This will give everyone in your application group the ability to get a snapshot view of the status of every server in the system. Having this kind of status is invaluable when you are doing development.
Nagios gives you network monitoring, but also allows you to show scheduled down time for a particular server in the system.
Another thing your group could do to foster communication with the Infrastructure is to have your build system report which servers it is currently using for building and testing your products.
Also, setting up regular meeting between stakeholders of both groups is probably a good idea too. If you all are talking to each other, even for 15 minute a week, you'll probably see incidents like the one you described above go down quite a bit.
I think this is a bigger issue of change control.
You should have hardware and software change control and an approval process.
Ultimately, infrastructure serves you - the purpose for IT infrastructure is to run applications.
In my current large financial data company, servers are not TOUCHED without proper authorization through the client and application groups. It seems like a huge pain, but every single server is there for a reason - to meet a specific business goal and run a specific application. There is simply no excuse for the infrastructure group to be changing things or upsetting servers on their own volition.
Response to critical hardware failure might be an exception.
Needed software and OS updates are handled through scheduled maintenance windows and an approved change process.
I like the Nagios idea as well. If you want to setup something that's more of a communication tool, I would recommend a content management system like Drupal.
We use Drupal internally to communicate between teams. When one team takes a server down, they would add an event into Drupal. The rest of us would either get it as an email, an RSS item or just by refreshing the page.
Implement a change control process where changes are submitted, approved and scheduled for BOTH groups. This lets everyone know what is going on. This process can be as light or heavy-weight as you want.
I am curious on how others manage code promotion from DEV to TEST to PROD within an enterprise.
What tools or processes do you use to manage the "red tape", entry/exit criteria side of things?
My current organisation is half stuck between some custom online forms type functionality and paper based dependencies to submit documents, gather approvals and reviews.
All this is left in the project managers hands to track what has been submitted, passed review, approved and advise management if there are any roadblocks that may need approval to be "overlooked" before an application can be promoted to the next environment.
A browser based application would be ideal... so whats out there? please show me that you googlefu is better than mine.
It's hard to find one that's good via google. There is a vast array of tools out there for issue management so I'll mention what we use and what we woudl like to use.
We currently use serena products. They have worked well for us in the past. Team Track is our issue management and handles the life cycle of any issue we work on. Version Manager is our source control and has the feature of implementing promotional groups like DEV TEST And PROD. We use DEV, TSTAGE, TEST, PSTAGE and PROD to signify the movement from one to the other, but it's much the same. The two products integrate nicely so that the source associated with the issues is linked, but we have no build process setup in this environment. It's expensive, but it works well.
We are looking ot move to a more common system using Jira for issue management, Subversion for source control, Fisheye to link the two together and Cruise Control for build management. This is less expensive, totaling a few thousand for an enterprise lisence and provides all the same features but with the added bonus of SVN which is a very nice code version mangager.
I hope that helps.
There are a few different scenarios that I've experienced over the years:
Dev -> Test : There is usually a code freeze date that stops work on new features and gets a test environment the code that has been tagged/labelled/archived that gets built. This then gets copied onto the machines and the tests go fine. This is also usually the least detailed of any push.
Test->Prod : This requires the minor change that production has to go down which can mean that a "gone fishing" page goes up or IIS doesn'thave any sites running and the code is copied over again. There are special cases to this where a load balancer can act as a switch so that the promotion happens and none of the customers experience any down time as the ones on the older server will move once their session ends.
To elaborate on that switch idea, the set up is to have 2 potentially live servers with just one server taking requests that the load balancer just sends all the traffic to one machine that can be switched when the other server has the updated code to go live.
There can also be a staging environment which is between test and production where the process is similar in terms of there is a set date when the promotion happens.
Where I used to work there would be merge days where a developer spent most of a day in Perforce merging code so that it could be promoted from one environment to another.
Now there are a couple of cases where this isn't used:
"Hotfixes" or "Hot patches" would occur where I used to work and in this case the specific files were copied up into the staging and production environments on its own since the code change had to get into Production ASAP since something broke in production or some new thing that had to get done that takes 2 minutes gets done. In this case, the code change getting pushed in had to be reviewed and approved before going out.
Those are the different approaches I've seen used where generally there are schedules and timelines potentially have to be changed or additional resources brought in to make a hard date like if a conference is on a particular weekend that such and such is ready for that.
Of course in a few places there has been the, "Oh, was that broken? Let me see..." and a few minutes later, "No, see it isn't broken for me," where someone changed things without asking permission or anything where a company still has what they call "cowboy programming."
Another point is the scale of the release:
1) Tiny - This is the case where one web page goes up so that user X can do Y.
2) Small - A handful or so of files that isn't really complicated but isn't exactly trivial.
3) Medium - Where going from one environment to another requires changing a bunch of files and usually has scripts to move.
4) Big - Where there are scheduled promotions and various developers are asked for who is taking which shifts when the live push is done. I had this in a case where there was a data migration to do in addition to a release of some new e-commerce sites.
5) Mammoth - Where everything is brand new including how this would be used. I don't think I've ever seen one of this size but I'd imagine Microsoft or Google would have releases of this size.
Somewhere in that spectrum most releases fall and so how much planning and preparation can vary quite a bit and let's not forget that regulatory compliance can be its own pain in getting some things done.