System replacement technology - scalability

I have some questions that I am hoping someone out there will be able to answer for me.
Our situation is that we are considering a ground up replacement for an existing system. Firstly I will describe the existing system that we have.
We are currently operating on a pure object stack. The enviroment is OO and the database is OO. We currently have 3-4 million lines of code which was developed by 2-3 people, and we currently have a development team of 6, which continues to develop. The initial development started in 1997, and we have many clients installed. The environment is 64-bit, language and database, mulit-lingual, and is UNICODE. The operatiung system we use is Windows (latest versions). We have a number of modules which are delivered via a thin client (not browser), and the bandwidth usage is very low (Operates on 64KB WAN network performance level which is still prevalent in some countries in which we operate, i.e. the infrastructure is poor). Our biggest implementation is for one of the biggest companies in the world, and the target is to deliver the functionality for 30+ countries from one system instance (one physical db) for that client, and deliver the functionality using thin client to all countries from one set of application servers (the application servers are located with the db server and perform all of the processing), the thin client deals with the interactions with the users and the display of the data and collection of the data only. The system is used by 1000s of users, on the thin client. We also have mobile and portal components also, which are developed in C#, they are a small segment of the overall system and connect using APis. There are maybe 1000 mobile application users, with a final number expected to be 5000 mobile users. Within the system there will be 500000-1000000 vendors, with each vendor expected to have at least two transactions every single day.
The DB itself is partitioned, and replicated to a number of locations in real time. The final size of the DB when implementation is complete is expected to be in the 2TB range, and the current system will deal with that, no problem. The way the replication works is that there are mutiple replicated enviroments on hot-standy, i,e. all application servers and API servers are replicated. Our largest client routinely (once per month) performs scheduled windows updates, and when this occurs the primary environments are automatically rolled over to the secondaries, so the system remains available all of the time. In subsequent months, the system is rolled back to the primaries, this transition is very fast, i.e. real time.
At our largest client, the system was installed in 2014, and since that time it has not experienced any outage, except for planned outages because of server maintenace of whateveer in that time period, i.e. it has not crashed or faulted in the first three years of operation. For the purposes of providing updates and enhanced functionality to the target organisation or specifically one of their subsiduries in the countries in which they operate we are able to make changes to the system, via the loading of functional updates on-line. This is a very important component of my question, as for many years we have been able to update at one central location and have the new functionality immeadiately available to all users in all countries whilst they are continuosly using the application. This is without change to any .EXE or .DLL or whatever files that the end user is operating. This is a huge advantage for us currently, as many of the organisations we provide services to do NOT allow any change to EXE or DLL files on end user devices, and there is generally some approval process which takes some days and requires manual intervention by the users to make this process happen.
For further information, we have a support team of 6 providing support services to all of our clients in all of these countries, we operate three shifts of 2 people to provide these services. So this should give you some background to the stability of the system and the level of support we provide. Our service level is described as outstanding. We do have of course SLA agreements in place and we have not violated any SLA term ever.
So, now for my question. What technology would people choose to replace such a system, and how many people would it take to replace ? It has been recommended to me that C# and SQL server be used to replace this, and that it would take a couple of good people a year or two to re-develop from the ground up (we have all of the functional specifications from the last 20 years to work from). However, without having in depth knowledge of this technology stack I am quite concerned about the time period (I think it is very optimistic), I am concerned about the scaleability of the SQL server, and most importantly I am deeply concerned that we will loose this advantage that we have enjoyed that allows us to change the functionality of the current system via updates online without effecting logged on users. I am told that this sort of thing is just not possible in C# and if we have to provide an update to fix a bug, or provide new functionality then all users will have to replace the effected EXE and DLL files, i.e. all of them, 1000s of users would have to do this each and every time we update. This would be done automatically via a process called OneClick, but I am assuming if there is a company policy within our client environment that EXE changes are not allowed, then OneClick will not be viable. I am told if we took a browser approach to the new development then any updates would be server side (which is better), but, would still require an outage to apply updates.
Finally, more information on the online updates that are now possible. Currently all of the systems are replicated for disaster recovery and 100% uptime during update purposes. When we currently update our systems (at one central location) those logical updates are automatically applied at all replicated systems also without user intervention. Another concern that I have is that as well as the problem we face with updating multiple locations with the same update, which it seems is a requirement in C# or so I am being told, we will also have all of the replicated systems to update manually as well. As you can see our support team is small, so I am worried about a future blowout in maintenance resources required to maintain all of this, and then the cost in terms of times fixing mistakes that may creep in with all these additional tasks that may be required to perform the same exercise that we currently do only once.
Finally, a final peice of information on how we currently do updates. If the update is structural in nature, i.e. changes the physical structure of the database, then an outage is required, a full system down outage. When we apply the update the structural change is made, and this is automatically replicated across all secondary (standby enviroments). The users are not effected in terms of the software for the thin client or browsers. They simply log back on after the outage is complete. We currently have a window at a set time, once per month to perform these updates, however, it is rarely required. Once per week, we have a window for functional changes to be applied, and these are appled on line whilst the users are all on line performing their daily and periodic tasks.
So, if anyone out there can give me some insight into what technologies are available for such a system replacement or whether C# and SQL server can provide the necessary services and performance we actually need, i.e. I would be particularly interested to know whether in fact C# applications can be updated in real time, then that would be fantastic. We are obviously in the very early stages of this process in terms of how this should be done, so any information you can provide would be greatly appreciated and will save many hours of research.
Thank you in advance.

From the basic requirements you describe, my first thought is that you should probably adopt a full Web-based solution for your system, that way all updates can be done centrally without too much negative effect on your client access.
But if I understand correctly your question, one aspect you're requiring is to have executable code ready at the client-side (so a pure Web-solution won't work).
In that case, something that can quickly & easily update at the client side is needed.
We've been using the node.js and MongoDB stack for a few years now, there are some quite interesting effects of using pure scripts for your business logic: besides being easy to develop, the scripts themselves, when designed with certain guidelines, can perform "hot reload" on the fly to update your business logic. So this is what I'd recommend trying / looking at.
Efficiency of node.js and the flexibility provided by NoSQL DB such as MongoDB is well described in many places if you do a simple Google search.

Related

How to implement rolling updates and a relational database?

I have an existing system that uses a relational DBMS. I am unable to use a NoSQL database for various internal reasons.
The system is to get some microservices that will be deployed using Kubernetes and Docker with the intention to do rolling upgrades to reduce downtime. The back end data layer will use the existing relational DBMS. The micro services will follow good practice and "own" their data store on the DBMS. The one big issue with this seems to be how to deal with managing the structure of the database across this. I have done my research:
https://blog.philipphauer.de/databases-challenge-continuous-delivery/
http://www.grahambrooks.com/continuous%20delivery/continuous%20deployment/zero%20down-time/2013/08/29/zero-down-time-relational-databases.html
http://blog.dixo.net/2015/02/blue-turquoise-green-deployment/
https://spring.io/blog/2016/05/31/zero-downtime-deployment-with-a-database
https://www.rainforestqa.com/blog/2014-06-27-zero-downtime-database-migrations/
All of the discussions seem to stop around the point of adding/removing columns and data migration. There is no discussion of how to manage stored procedures, views, triggers etc.
The application is written in .NET Full and .NET Core with Entity Framework as the ORM.
Has anyone got any insights on how to do continious delivery using a relational DBMS where it is a full production system? Is it back to the drawing board here? In as much that using a relational DBMS is "too hard" for rolling updates?
PS. Even though this is a continious delivery problem I have also tagged with Kubernetes and Docker as that will be the underlying tech in use for the orchestration/container side of things.
All of the following under the assumption that I understand correctly what you mean by "rolling updates" and what its consequences are.
It has very little (as in : nothing at all) to do with "relational DBMS". Flatfiles holding XML will make you face the exact same problem. Your "rolling update" will inevitably cause (hopefully brief) periods of time during which your server-side components (e.g. the db) must interact with "version 0" as well as with "version -1" of (the client-side components of) your system.
Here "compatibility theory" (*) steps in. A "working system" is a system in which the set of offered services is a superset (perhaps a proper superset) of the set of required services. So backward compatibility is guaranteed if "services offered" is never ever reduced and "services required" is never extended. However, the latter is typically what always happens when the current "version 0" is moved to "-1" and a new "current version 0" is added to the mix. So the conclusion is that "rolling updates" are theoretically doable as long as the "services" offered on server side are only ever extended, and always in such a way as to be, and always remain, a superset of the services required on (any version currently in use on) the client side.
"Services" here is to be interpreted as something very abstract. It might refer to a guarantee to the effect that, say, if column X in this row of this table has value Y then I will find another row in that other table using a key computed such-and-so, and that other row might be guaranteed to have column values satisfying this-or-that condition.
If that "guarantee" is introduced as an expectation (i.e. requirement) on (new version of) client side, you must do something on server side to comply. If that "guarantee" is currently offered but a contradicting guarantee is introduced as an expectation on (new version of) client side, then your rolling update scenario has by definition become inachievable.
(*) http://davidbau.com/archives/2003/12/01/theory_of_compatibility_part_1.html
There are also parts 2 and 3.
I work in an environment that achieves continuous delivery. We use MySQL.
We apply schema changes with minimal interruption by using pt-online-schema-change. One could also use gh-ost.
Adding a column can be done at any time if the application code can work with the extra column in place. For example, it's a good rule to avoid implicit columns like SELECT * or INSERT with no columns-list clause. Dropping a column can be done after the app code no longer references that column. Renaming a column is trickier to do without coordinating an app release, and in this case you may have to do two schema changes, one to add the new column and a later one to drop the old column after the app is known not to reference the old column.
We do upgrades and maintenance on database servers by using redundancy. Every database master has a replica, and the two instances are configured in master-master (circular) replication. So one is active and the other is passive. Applications are allowed to connect only to the active instance. The passive instance can be restarted, upgraded, etc.
We can switch the active instance in under 1 second by changing an internal CNAME, and updating the read_only option in each MySQL instance.
Database connections are terminated during this switch. Apps are required to detect a dropped connection and reconnect to the CNAME. This way the app is always connected to the active MySQL instance, freeing the passive instance for maintenance.
MySQL replication is asynchronous, so an instance can be brought down and back up, and it can resume replicating changes and generally catches up quickly. As long as its master keeps the binary logs needed. If the replica is down for longer than the binary log expiration, then it loses its place and must be reinitialized from a backup of the active instance.
Re comments:
how is the data access code versioned? ie v1 of app talking to v2 of DB?
That's up to each app developer team. I believe most are doing continual releases, not versions.
How are SP's, UDF's, Triggers etc dealt with?
No app is using any of those.
Stored routines in MySQL are really more of a liability than a feature. No support for packages or libraries of routines, no compiler, no debugger, bad scalability, and the SP language is unfamiliar and poorly documented. I don't recommend using stored routines in MySQL, even though it's common in Oracle/Microsoft database development practices.
Triggers are not allowed in our environment, because pt-online-schema-change needs to create its own triggers.
MySQL UDFs are compiled C/C++ code that has to be installed on the database server as a shared library. I have never heard of any company who used UDFs in production with MySQL. There is too a high risk that a bug in your C code could crash the whole MySQL server process. In our environment, app developers are not allowed access to the database servers for SOX compliance reasons, so they wouldn't be able to install UDFs anyway.

Architecture ideas for Rails 3 application

I was hoping to get some opinions regarding the architecture of our Rails 3 application.
Currently we have a Rails 3.0.7 application that allows users to manage content that is displayed on their TV (promotions, videos, menus, sports stats, etc.) through one of our connected media devices. We have over 1000 (and growing) of these connected devices that poll our system every minute to check for changes to their content, and every 15 minutes to report their stats (e.g. CPU, Memory, etc.).
One of the major advantages of our system is that we, as admins, can change how an individual content item looks/works and it is distributed to all devices that use it. The disadvantage of this feature is when we make a change our system becomes temporarily unusable because all the connected devices ask for their update about the same time.
Therefore, we plan on re-architecting our application so the content mgt. system is not impacted when the devices are communicating with the application. There are probably dozens of way to solve this issue. One way would be to have a separate Rails application that is only for the devices to get the content they should display, admins can monitor, etc.. It could share the model, database, etc. with the current content mgt. system. This way might prove difficult to manage the models, migrations, etc. I obviously don't want to duplicate the models. Also it would be ideal if the content mgt. system could still display status of devices for accounts so account admins can see if their devices are online, etc.
I'm thinking some type of queue mechanism is a good fit like resque/redis because when changes are made in the content mgt. system we could just queue a job which the device instance could pick up and process.
I wanted to toss this out to the community to get opinions and ideas from other folks that might have worked or are still working with systems that leverage connected devices. Thanks in advance for your contributions. I appreciate it!
Louis
1000+ clients with ~1 req. per minute each does not sound like a load that requires architectural changes for normal operation. Generally, simple one-app architecture will be easier to maintain in the long run, so you should try sticking to it until there're issues that cannot be solved.
If performance/responsiveness is the main issue, why not add a caching proxy server to the stack?
Other simple option is to install the app on two servers and use one for admin and other for client devices. Note that this will only help if database isn't the bottleneck.

Should I choose cloud?

I'm about to start development on a project with very uncertain load/traffic specifics. When it will be released there will certainly be very low load that can easily be handled by a single desktop quad code machine.
The problem is that there will be (after some invite-only period) a strong publicity for the product so I expect considerable traffic/load peaks.
I haven't read enough about cloud providers and I'm mostly leaning toward Amazon or Azure for the credibility these two companies have without checking them out as I should with others (ie. Rackspace that I suppose is also a cloud service provider).
What I want
I would like to create a normal Asp.net MVC web application that can be run on in-house single machine low-cost server. It would run web server along with database (relational and maybe also document) and fulltext search (not SQL FTS but rather high speed separate product like Lucene or Sphinx). But after initial invite-only period I'd like to move this app to the cloud to make it more traffic/load demand-friendly.
As much as I know Amazon offers a sort of virtual machine hosting which I understand you setup as a normal server but has possible flexible resources in terms of load power. I'm not sure if that can be accomplished on Azure as well.
Questions
What is your experience with application transition to cloud and which one did you choose and why?
What would you recommend I should think about when designing/developing the solution to make the transition as painless as possible.
Based on your experience is it better to move to the cloud (financial wise) or is it better to buy your own servers and load balance application yourself and maybe save money on the long run?
"Cloud" is such a vague term. Still, I think this is a very good question.
Basically, IaaS cloud hosting does not magically make your application scale. It's really a virtual private server with very short contract / cancellation periods.
For scalability, the magic lies not so much in the hosting, but in the horizontal scalability of the application code itself. This is related to all the distributed computing challenges. For example, adding more application servers is not always easy: you must be sure that you don't persist any user state in the server application (but rather in a database, static can be evil), caching can be problematic because local caches can make the situation worse if you're using a round-robin strategy, etc.
What is your experience with application transition to cloud and which one did you choose and why?
What would you recommend I should think about when designing/developing the solution to make the transition as painless as possible.
You don't really have to do anything different just to host on EC2 or Azure -- basically. But of course, it's not that easy when things grow.
For instance, EC2 instance storage is rather limited. Additional storage on EBS, however, does not provide comparable performance characteristics and can be a bit more laggy than a disk. The point here is that EBS does magically scale, and it's probably more PaaS than IaaS; but it's not a simple hard disk and it does, consequently, not behave like a hard drive. I don't know about Azure block storage. In general, expect additional abstraction layers to introduce problems of their own, no matter what they do.
Based on your experience is it better to move to the cloud (financial wise) or is it better to buy your own servers and load balance application yourself and maybe save money on the long run?
Typical cloud providers are more expensive than the usual 'round-the-corner VPS providers, but they are, to my experience, also much more reliable and professional. EC2 has a free tier (but it's quite small), Azure gives you a small instance for free for 3 months.
Doing the calculation right is rather tricky; for example, if you have to shut down your service for whatever reason, it's nice to be able to cancel now rather than pay another year - you might want to put this risk into your calculation. On the other hand, both EC2 and Azure will be considerably cheaper if you sign up for 6 or 12 months, rather than paying by the hour.
You might want to check out the free Azure plan, because it's nice to start fiddling around without any cost. A big advantage of cloud providers is that you can scale vertically very easily: buying a 16 core, 64GB RAM server machine is really expensive, but if there's so much traffic on your site, upgrading your plan won't be such a big issue.
As someone hasn't mention it yet...
AppHarbor has been amazing. You can push stuff in a matter of minutes. Deployment is a breeze. And setting up your project for it is easier as well. And it doesn't even require any major changes in your solution to fit in.
For the full-text search, you might consider something like Websolr.
A lot of this depends on what your app is doing (e.g., are there separable components that might benefit from running on different instances, vs. a simple CRUD application with a front end). One thing to consider is that in a cloud application you normally don't have a traditional relational database. As such, you have to choose either cloud or traditional hosting, or plan on coding your access layer twice. Azure does have relational databases (SQL Azure), although they're not identical to SQL Server 2008R2. You're going to have to research the pros/cons of a cloud setup for your specific situation.
As far as financial concerns, it's usually a lot cheaper to just get an account with a hosting company instead of a cloud service, since you pay by the month, instead of the hour (last time I checked an account with Azure running 24/7 for a month would cost about $40-$50, while you can get hosting for $15 a month). The savings with the cloud come in when you have to run several servers, and the cost of maintaining them surpasses the cost of the instance on the cloud platform.
So, sorry, there's no silver bullet answer for you. Read up on the different services available. Consider what your application needs, what prices will be, and go from there.
I have just migrated an MVC-based application from a dedicated server to Azure. When migrating the MSSQl-database, I first tried importing .bacpak files but some of the tables failed because of their size. I then used the SQL Database migratio wizard which worked fine for small tables but failed for tables with BLOB-fields. For these tables I had to use temporary intermediate tables. Then after a while after all the data was transferred setting up the Webapp was a breeze and we went in production. At first, everything seemed to work just fine, but after a couple of hours when the load got heavier, all kind of errors occurred. I went into the Azure portal and it was really easy to see the

Keeping applications and infrastructure connected

I work in an IT department that is divided into two groups. One group develops and manages applications, the other manages the company's infrastructure and servers. One of the problems we face is a break down in communication. I work for the application group and one of the problems I have is not being notified when a server is taken down by infrastructure, or a database is being refreshed.
Does anyone have suggestions on how to improve communications between the two groups or any ideas on how to keep a light-weight log across multiple systems (both linux and windows)? Ideally it would be nice if we could have our boxes just tweet their statuses or something.
Thanks for the help,
Ben
One thing you could do to communicate server status is to have our Infrastructure group setup a network monitoring system like Nagios. This will give everyone in your application group the ability to get a snapshot view of the status of every server in the system. Having this kind of status is invaluable when you are doing development.
Nagios gives you network monitoring, but also allows you to show scheduled down time for a particular server in the system.
Another thing your group could do to foster communication with the Infrastructure is to have your build system report which servers it is currently using for building and testing your products.
Also, setting up regular meeting between stakeholders of both groups is probably a good idea too. If you all are talking to each other, even for 15 minute a week, you'll probably see incidents like the one you described above go down quite a bit.
I think this is a bigger issue of change control.
You should have hardware and software change control and an approval process.
Ultimately, infrastructure serves you - the purpose for IT infrastructure is to run applications.
In my current large financial data company, servers are not TOUCHED without proper authorization through the client and application groups. It seems like a huge pain, but every single server is there for a reason - to meet a specific business goal and run a specific application. There is simply no excuse for the infrastructure group to be changing things or upsetting servers on their own volition.
Response to critical hardware failure might be an exception.
Needed software and OS updates are handled through scheduled maintenance windows and an approved change process.
I like the Nagios idea as well. If you want to setup something that's more of a communication tool, I would recommend a content management system like Drupal.
We use Drupal internally to communicate between teams. When one team takes a server down, they would add an event into Drupal. The rest of us would either get it as an email, an RSS item or just by refreshing the page.
Implement a change control process where changes are submitted, approved and scheduled for BOTH groups. This lets everyone know what is going on. This process can be as light or heavy-weight as you want.

Tools to assist managing the application promotion process in an enterprise environment

I am curious on how others manage code promotion from DEV to TEST to PROD within an enterprise.
What tools or processes do you use to manage the "red tape", entry/exit criteria side of things?
My current organisation is half stuck between some custom online forms type functionality and paper based dependencies to submit documents, gather approvals and reviews.
All this is left in the project managers hands to track what has been submitted, passed review, approved and advise management if there are any roadblocks that may need approval to be "overlooked" before an application can be promoted to the next environment.
A browser based application would be ideal... so whats out there? please show me that you googlefu is better than mine.
It's hard to find one that's good via google. There is a vast array of tools out there for issue management so I'll mention what we use and what we woudl like to use.
We currently use serena products. They have worked well for us in the past. Team Track is our issue management and handles the life cycle of any issue we work on. Version Manager is our source control and has the feature of implementing promotional groups like DEV TEST And PROD. We use DEV, TSTAGE, TEST, PSTAGE and PROD to signify the movement from one to the other, but it's much the same. The two products integrate nicely so that the source associated with the issues is linked, but we have no build process setup in this environment. It's expensive, but it works well.
We are looking ot move to a more common system using Jira for issue management, Subversion for source control, Fisheye to link the two together and Cruise Control for build management. This is less expensive, totaling a few thousand for an enterprise lisence and provides all the same features but with the added bonus of SVN which is a very nice code version mangager.
I hope that helps.
There are a few different scenarios that I've experienced over the years:
Dev -> Test : There is usually a code freeze date that stops work on new features and gets a test environment the code that has been tagged/labelled/archived that gets built. This then gets copied onto the machines and the tests go fine. This is also usually the least detailed of any push.
Test->Prod : This requires the minor change that production has to go down which can mean that a "gone fishing" page goes up or IIS doesn'thave any sites running and the code is copied over again. There are special cases to this where a load balancer can act as a switch so that the promotion happens and none of the customers experience any down time as the ones on the older server will move once their session ends.
To elaborate on that switch idea, the set up is to have 2 potentially live servers with just one server taking requests that the load balancer just sends all the traffic to one machine that can be switched when the other server has the updated code to go live.
There can also be a staging environment which is between test and production where the process is similar in terms of there is a set date when the promotion happens.
Where I used to work there would be merge days where a developer spent most of a day in Perforce merging code so that it could be promoted from one environment to another.
Now there are a couple of cases where this isn't used:
"Hotfixes" or "Hot patches" would occur where I used to work and in this case the specific files were copied up into the staging and production environments on its own since the code change had to get into Production ASAP since something broke in production or some new thing that had to get done that takes 2 minutes gets done. In this case, the code change getting pushed in had to be reviewed and approved before going out.
Those are the different approaches I've seen used where generally there are schedules and timelines potentially have to be changed or additional resources brought in to make a hard date like if a conference is on a particular weekend that such and such is ready for that.
Of course in a few places there has been the, "Oh, was that broken? Let me see..." and a few minutes later, "No, see it isn't broken for me," where someone changed things without asking permission or anything where a company still has what they call "cowboy programming."
Another point is the scale of the release:
1) Tiny - This is the case where one web page goes up so that user X can do Y.
2) Small - A handful or so of files that isn't really complicated but isn't exactly trivial.
3) Medium - Where going from one environment to another requires changing a bunch of files and usually has scripts to move.
4) Big - Where there are scheduled promotions and various developers are asked for who is taking which shifts when the live push is done. I had this in a case where there was a data migration to do in addition to a release of some new e-commerce sites.
5) Mammoth - Where everything is brand new including how this would be used. I don't think I've ever seen one of this size but I'd imagine Microsoft or Google would have releases of this size.
Somewhere in that spectrum most releases fall and so how much planning and preparation can vary quite a bit and let's not forget that regulatory compliance can be its own pain in getting some things done.

Resources