Multiple Projects, Multiple languages, Same authentication - ruby-on-rails

So I have multiple projects that I want to use a common core set of tables in a Postgres database to map an authentication scheme between them with. Things like a 'user' 'account' 'group' or other related user information is stored in these tables. The projects I have currently are a nodejs (multiple devices) and a Ruby web app (planning on multiple devices later on) and we could have a Django or another node project in the future as well. Is there an efficient, cost effective way to do this that would be scalable and reliable? I was thinking about using an s3 instance with a Postgres database hosted on it and pointing all my authentication from my multiple apps to that database but I wanted to see if others had thought about this problem as well.

First, please don't attempt to host postgres in S3.. I believe you may have meant EC2 with an EBS volume (which is really on s3). From an ease of use standpoint (particularly when considering ongoing maintenance), hosting any postgres instance on Amazon's RDS product is truly a pleasure. Without going into all the details of that product, I'll simply state that you can set up high availability (failover), backups, upgrades, and monitoring with just a few clicks.
That being said, RDS is not the cheapest solution, but the cost is not exorbitant either, depending on your load and number of simultaneous connections.
If all this database is going to do is authenticate people and then disconnect-- that very well will be overkill and will be a waste of resources. However, if you're housing a fairly complex set of permissions and other user data, it'll likely be a fairly straightforward solution.
Depending on your budget and requirements, you may benefit from running pgPool on your app server somewhere to pool the connections-- but I wouldn't start out using pgPool unless you need it.

Related

share session among different type of web servers

Some web services in my company are built with different web apps.(Rails, Django, PHP)
What's the better practice to share the session status
So that user won't have to login again and again among different servers.
I build my Rails apps in AWS auto scaling group.
That is, even I browse the same website, but next time I may browser on another server, so that I have to login again. because the server doesn't have my session status.
I need some better idea or keywords for me to search about that kind of issue.
Thanks in advance
I can think of two ways in which you can achieve this objective
Implement a custom session handling mechanism that makes use of database session management, i.e. all sessions will be stored in a special table in the database and will be accessible to all the servers.
Have a Central Authentication Service (CAS) which will act as a proxy to all the other servers. This will then mean that this step has to happen before the requests reach the load balancer.
If you look around, option 1 might be recommended by many, but it may also be an overkill since you'll need custom session management in each of the servers. However, your choice would probably depend on the specific objectives you want to achieve, the overall flexibility of the system architecture and the amount of time you have on your hands. The CAS might be a more straightforward way of solving the problem.
Storing user sessions in your applications database wouldn't be recommended option for AWS.
The biggest problem with using a database, is that you need to write some clean up script that runs every so often to clear the table of all the expired user sessions. This is messy, creates more overhead, and puts more pressure on your DB.
However, if you do want to use an actual database for this, you should use a NoSQL database like Dynamo. This will give you much better performance than a relational database. It's probably more cost effective too in terms of data transfer. However, the biggest problem with this is that you still need that annoying clean up script. Note There is built in support in the SDK for using PHP with Dynamo for storing the user's session:
http://docs.aws.amazon.com/aws-sdk-php/v2/guide/feature-dynamodb-session-handler.html
The best but most costly solution is to use an ElastiCache cluster. You can set a TTL of your choice which means you won't have to worry about clean up scripts. Also ElastiCache will get you much better performance than Dynamo or any relational DB as the data is stored in the RAM of the ElastiCache nodes. The main drawback of ElastiCache is that currently, it can't scale dynamically. So if too many users logged in at once, if you didn't have a big enough cluster already provisioned, things could get ugly.
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/WhatIs.html
But you can bet that all the biggest, baddest and best applications being hosted on AWS are either using Dynamo, ElastiCache or a custom NoSQL or Cache cluster solution for storing user sessions.
Note that both of these services are supported in all of the AWS SDKs:
https://aws.amazon.com/tools/

How can one rails app on heroku access many databases

I want to be able to have one app access multiple databases on the HEROKU "system".
Can the connection to the database be changed dynamically?
Why I ask...
I have an app that has a lot of very processor heavy background jobs. If a given user uploads a product feed of say 50,000 product that have to be compared to existing products and update only the deltas it can take a "few" minutes.
Now to mitigate the delay I spin up multiple workers, each taking small bites out of the lot until there's none. I can get to about 20 workers before the GUI starts to feel sluggish because the DB is being hammered.
I've tuned some of the code and indexed the DB to some extent, and I'm sure there's more I could do, but it will eventually suffer the law of diminished returns.
For one user, I don't much care... if you upload 50k products you need to wait a bit..
But user one's choice to upload impacts user two. (different company so no cross over of data)..
Currently I handle different users by separating their data with schemas in postgresql.
The different users however share the same db connection and even on the best plan I can see a time when 20 users try to upload 50,000 products at the same time.(first of month/quarter for example).
User 21 would see a huge slow down on their system because of this..
So the question: Can I assign different users to different databases? User logs in, validates their info against a central DB, and then a different DB takes over?
My current solution is different instances of heroku. It's easy to maintain the code because it's one base and I just script the git push(es). The only issue is the different login URL's; which I suppose I could confront if I can't find an easy DB switch solution.
It sounds like you're able to shard your data by user, or set of users without much concern since you already separate them by schema. If that's the case, and you're using Ruby and ActiveRecord, look at https://github.com/tchandy/octopus. I imagine you're not looking to spin up databases on the fly, rather, you'll have them already built and ready to be used, and can add more as you go.
Granted, it sounds like what you're doing could be done a lot more effectively by using the right tool for that type of intensive processing like one of the Heroku Hadoop add-ons; nonetheless, if that's not an option for whatever reason, check out the gem above. There are a couple other gems like it, and of course you could technically manage your own ActiveRecord connections without this gem, but I think you'll find that will be painful really fast.
Of course, if you aren't using Ruby or ActiveRecord, still shard the data, and look for something like the gem above in your app's language :).
the postgres databases on heroku are configured with environment variables. when you run heroku config you should see:
DATABASE_URL: postgres://xxx.compute.amazonaws.com:5432/xxx
you can use these variables to connect to databases on other heroku instances or share a single database on different heroku apps.
if you try to run this kind of stuff on free heroku instances, i think it is against their terms of services.
if it's about scalability, i think you will just have to pay for a more expensive database instance...

Should I choose cloud?

I'm about to start development on a project with very uncertain load/traffic specifics. When it will be released there will certainly be very low load that can easily be handled by a single desktop quad code machine.
The problem is that there will be (after some invite-only period) a strong publicity for the product so I expect considerable traffic/load peaks.
I haven't read enough about cloud providers and I'm mostly leaning toward Amazon or Azure for the credibility these two companies have without checking them out as I should with others (ie. Rackspace that I suppose is also a cloud service provider).
What I want
I would like to create a normal Asp.net MVC web application that can be run on in-house single machine low-cost server. It would run web server along with database (relational and maybe also document) and fulltext search (not SQL FTS but rather high speed separate product like Lucene or Sphinx). But after initial invite-only period I'd like to move this app to the cloud to make it more traffic/load demand-friendly.
As much as I know Amazon offers a sort of virtual machine hosting which I understand you setup as a normal server but has possible flexible resources in terms of load power. I'm not sure if that can be accomplished on Azure as well.
Questions
What is your experience with application transition to cloud and which one did you choose and why?
What would you recommend I should think about when designing/developing the solution to make the transition as painless as possible.
Based on your experience is it better to move to the cloud (financial wise) or is it better to buy your own servers and load balance application yourself and maybe save money on the long run?
"Cloud" is such a vague term. Still, I think this is a very good question.
Basically, IaaS cloud hosting does not magically make your application scale. It's really a virtual private server with very short contract / cancellation periods.
For scalability, the magic lies not so much in the hosting, but in the horizontal scalability of the application code itself. This is related to all the distributed computing challenges. For example, adding more application servers is not always easy: you must be sure that you don't persist any user state in the server application (but rather in a database, static can be evil), caching can be problematic because local caches can make the situation worse if you're using a round-robin strategy, etc.
What is your experience with application transition to cloud and which one did you choose and why?
What would you recommend I should think about when designing/developing the solution to make the transition as painless as possible.
You don't really have to do anything different just to host on EC2 or Azure -- basically. But of course, it's not that easy when things grow.
For instance, EC2 instance storage is rather limited. Additional storage on EBS, however, does not provide comparable performance characteristics and can be a bit more laggy than a disk. The point here is that EBS does magically scale, and it's probably more PaaS than IaaS; but it's not a simple hard disk and it does, consequently, not behave like a hard drive. I don't know about Azure block storage. In general, expect additional abstraction layers to introduce problems of their own, no matter what they do.
Based on your experience is it better to move to the cloud (financial wise) or is it better to buy your own servers and load balance application yourself and maybe save money on the long run?
Typical cloud providers are more expensive than the usual 'round-the-corner VPS providers, but they are, to my experience, also much more reliable and professional. EC2 has a free tier (but it's quite small), Azure gives you a small instance for free for 3 months.
Doing the calculation right is rather tricky; for example, if you have to shut down your service for whatever reason, it's nice to be able to cancel now rather than pay another year - you might want to put this risk into your calculation. On the other hand, both EC2 and Azure will be considerably cheaper if you sign up for 6 or 12 months, rather than paying by the hour.
You might want to check out the free Azure plan, because it's nice to start fiddling around without any cost. A big advantage of cloud providers is that you can scale vertically very easily: buying a 16 core, 64GB RAM server machine is really expensive, but if there's so much traffic on your site, upgrading your plan won't be such a big issue.
As someone hasn't mention it yet...
AppHarbor has been amazing. You can push stuff in a matter of minutes. Deployment is a breeze. And setting up your project for it is easier as well. And it doesn't even require any major changes in your solution to fit in.
For the full-text search, you might consider something like Websolr.
A lot of this depends on what your app is doing (e.g., are there separable components that might benefit from running on different instances, vs. a simple CRUD application with a front end). One thing to consider is that in a cloud application you normally don't have a traditional relational database. As such, you have to choose either cloud or traditional hosting, or plan on coding your access layer twice. Azure does have relational databases (SQL Azure), although they're not identical to SQL Server 2008R2. You're going to have to research the pros/cons of a cloud setup for your specific situation.
As far as financial concerns, it's usually a lot cheaper to just get an account with a hosting company instead of a cloud service, since you pay by the month, instead of the hour (last time I checked an account with Azure running 24/7 for a month would cost about $40-$50, while you can get hosting for $15 a month). The savings with the cloud come in when you have to run several servers, and the cost of maintaining them surpasses the cost of the instance on the cloud platform.
So, sorry, there's no silver bullet answer for you. Read up on the different services available. Consider what your application needs, what prices will be, and go from there.
I have just migrated an MVC-based application from a dedicated server to Azure. When migrating the MSSQl-database, I first tried importing .bacpak files but some of the tables failed because of their size. I then used the SQL Database migratio wizard which worked fine for small tables but failed for tables with BLOB-fields. For these tables I had to use temporary intermediate tables. Then after a while after all the data was transferred setting up the Webapp was a breeze and we went in production. At first, everything seemed to work just fine, but after a couple of hours when the load got heavier, all kind of errors occurred. I went into the Azure portal and it was really easy to see the

ease of scaling mongodb vs mysql

I am creating a Grails application that is the backend for a mobile application. It is currently deployed on Amazon EC2. It persists data to a mysql database. One instance currently pointing to the database. I plan to deploy multiple instances of the app behind a load balancer and eventually have read requests go to slave instances of the db. We plan to release in the coming months and have a beta group of a couple of thousand users. It is more read intensive than write.
We have looked into using mongodb instead of sql and see it as a good solution.
Not having a lot of experience scaling mysql ( or mongodb ) would it be easier to scale mongodb since it has features such as auto sharding. ( Looking for thoughts from people who have done both ) I am of thinking it will be easier to switch to mongodb now rather than be in 'production' and having to migrate.
Thoughts?
MongoDB has two versions of "scaling":
Read scaling via replica sets.
Write scaling via sharding.
They're not silver bullets, but they're both very easy to set up. Replica sets have auto-failover which is practically essential when using EC2 (they have a good history of just randomly failing nodes). When you need write scaling, MongoDB has documented processes for upgrading your replica set to a series of sharded replica sets.
The unfortunate limitation is that (last I checked), things like scalr don't really support automatic scaling. So you'll have to roll your own solution for adding and removing nodes from the set.
Some important considerations:
Disk IO performance is sketchy in the cloud. Good performance is all about the amount of RAM you can throw at the problem.
If you're using replica sets for reads, ensure that your driver / data wrapper is capable of handling the distribution of reads. Just like MySQL it's not currently "free", you'll need to decide "write vs. read".
64-bit machines. MongoDB really wants to operate on 64-bit hardware. This is a cost consdieration as you'll probably have to ramp up with 4GB machines instead of 2GB machines (I don't think this is a big limitation, but I also know what it's like to be a startup).
MongoDB is still new tech. The lists are very active, and people are using it in production for very large data sets. But this is still a new product, you have to be prepared to work from the command-line and parse through docs and ask questions.
would it be easier to scale mongodb
At some level scaling is going to be a "hard" problem. What MongoDB does well is provide a way to really scale out lots of boxes horizontally with replication. In my experience, MySQL really tops out at around two boxes for writes. You can easily configure co-masters, but after that you have to start mucking around with all kinds of partitioning and you basically lose the ability to do joins.
I am of thinking it will be easier to switch to mongodb now rather than be in 'production'
It probably will.
Thoughts?
Start small. Get one piece working and see if you like how it works. If you have access to an EC2 account, then it's easy to spin up a couple of machines and play. MongoDB is not a panacea, but it works really well for a lot of modern web problems. Just measure how badly you need joins :)

Are there any stable and production quality nosql datastores?

Are there are production quality nosql stores that I can use on a production system. I have looked at cassandra, tokyodb, couchdb etc but none of them seem to be ready for deployments on production like environments. I am talking thousands of requests per minute and lots of reads/writes/updates. My only concern is speed and service times. Does anybody know of production systems that use nosql stores effectively ? Does anybody know of a nosql store that is backed by a big enterprise like Google/Yahoo/ IBM ?
Cassandra handles thousands of requests (including write-mostly workloads) per second, per machine, and its scaling-by-adding-machines has been there since day 1.
Here is a thread about Cassandra use in production and in-production-soon at dozens of companies: http://n2.nabble.com/Cassandra-users-survey-td4040068.html#a4040068
We're also adding more docs all the time, like http://wiki.apache.org/cassandra/Operations.
I think the NoSQL systems are an excellent choice if I you 'only' care about speed and service time (and not or less about stuff like consistency and transactions). Facebook uses Cassandra.
"Cassandra is used in Facebook as an email search system containing 25TB and over 100m mailboxes." http://highscalability.com/product-facebooks-cassandra-massive-distributed-store
I think CouchDb isn't really speedy, maybe you can use MongoDB: http://www.mongodb.org/display/DOCS/Production+Deployments
Also worth consideration is using a traditional RDBMS like MySQL to store schema-less. This method gives you the stability of a proven database server like MySQL with the flexibility a NoSQL solution.
Check out this blog posting on how FriendFeed does this.
BerkeleyDB is backed by Oracle
Using the native C interface one can reach close to 1 million read requests per second.
By the way, when you say thousands requests per minute, any 'normal' DB should be able to handle that easily too.
Redis is worth giving a try as Github uses redis to manage a heavy queue of background jobs.
My first instinct would be BerkeleyDB, with each application node on a SAMBA network to facilitate ACID conformance & network use. It also sports a SQLite interface. Other poster cites MemcacheDB also having BDB inside.
Another unique option would be OrientDB, also has a SQL interface, lots of network & cluster features.

Resources