Sharing data between Elastic Beanstalk web and worker tiers - ruby-on-rails

I have a platform (based on Rails 4/Postgres) running on an auto scaling Elastic Beanstalk web environment. I'm planning on offloading long running tasks (sync with 3rd parties, delivering email etc) to a Worker tier, which appears simple enough to get up and running.
However, I also want to run periodic batch processes. I've looked into using cron.yml and the scheduling seems pretty simple, however the batch process I'm trying to build needs to access the data from the web application to be able to work.
Does anybody have any opinion of the best way of doing this? Either a shared RDS database between web and worker tier, or perhaps a web service that the worker tier can access?
Thanks,
Dan
Note: I've added an extra question, which more broadly describes my
requirements as it struck me that this might not be the best approach.
What's the best way to implement this shared batch process with Elastic Beanstalk?

Unless you need a full relational database management system (RDBMS), consider using S3 for shared persistent data storage across your instances.
Also consider Amazon Simple Queue Service (SQS):
SQS is a fast, reliable, scalable, fully managed message queuing
service. SQS makes it simple and cost-effective to decouple the
components of a cloud application. You can use SQS to transmit any
volume of data, at any level of throughput, without losing messages or
requiring other services to be always available.

Related

Using SQS for batch running

I am starting a new architecture to support a MDM (Master data management) database. For the MDM database I will be using a graph database (Neo4J). So the I need to integrate disparate sources within the MDM database. I will have sources from orders, customers signup, likes from facebook and so on.
So I imagine that I can have one or multiple SQS queues for the sources. So each source will put messages on SQS.
Then I will have many worker nodes that will be responsible to get the messages from queue and update the MDM database.
At the moment, I will use just one worker, because I dont know how neo4j will performance doing multiple writers.
So I see many people using SQS to send messages to launch a batch process. Its like a job queueing.
Is it a common use case use SQS to send data messages? Or maybe using another AWS component.
Using Neo4j with a queueing system is pretty common. I've never used SQS myself, but being aware of people using ActiveMQ (or others) to feed data into Neo4j.
This is mostly done do decouple the architecture and provide a way to
gracefully deal with load peaks.
Depending on the kind of events you put onto the queue it might be beneficial if the consumer is single threaded - esp. when the events write the same part of the graph. Otherwise Neo4j's locking mechanism will prevent concurrent writes onto the same nodes/rels.

share session among different type of web servers

Some web services in my company are built with different web apps.(Rails, Django, PHP)
What's the better practice to share the session status
So that user won't have to login again and again among different servers.
I build my Rails apps in AWS auto scaling group.
That is, even I browse the same website, but next time I may browser on another server, so that I have to login again. because the server doesn't have my session status.
I need some better idea or keywords for me to search about that kind of issue.
Thanks in advance
I can think of two ways in which you can achieve this objective
Implement a custom session handling mechanism that makes use of database session management, i.e. all sessions will be stored in a special table in the database and will be accessible to all the servers.
Have a Central Authentication Service (CAS) which will act as a proxy to all the other servers. This will then mean that this step has to happen before the requests reach the load balancer.
If you look around, option 1 might be recommended by many, but it may also be an overkill since you'll need custom session management in each of the servers. However, your choice would probably depend on the specific objectives you want to achieve, the overall flexibility of the system architecture and the amount of time you have on your hands. The CAS might be a more straightforward way of solving the problem.
Storing user sessions in your applications database wouldn't be recommended option for AWS.
The biggest problem with using a database, is that you need to write some clean up script that runs every so often to clear the table of all the expired user sessions. This is messy, creates more overhead, and puts more pressure on your DB.
However, if you do want to use an actual database for this, you should use a NoSQL database like Dynamo. This will give you much better performance than a relational database. It's probably more cost effective too in terms of data transfer. However, the biggest problem with this is that you still need that annoying clean up script. Note There is built in support in the SDK for using PHP with Dynamo for storing the user's session:
http://docs.aws.amazon.com/aws-sdk-php/v2/guide/feature-dynamodb-session-handler.html
The best but most costly solution is to use an ElastiCache cluster. You can set a TTL of your choice which means you won't have to worry about clean up scripts. Also ElastiCache will get you much better performance than Dynamo or any relational DB as the data is stored in the RAM of the ElastiCache nodes. The main drawback of ElastiCache is that currently, it can't scale dynamically. So if too many users logged in at once, if you didn't have a big enough cluster already provisioned, things could get ugly.
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/WhatIs.html
But you can bet that all the biggest, baddest and best applications being hosted on AWS are either using Dynamo, ElastiCache or a custom NoSQL or Cache cluster solution for storing user sessions.
Note that both of these services are supported in all of the AWS SDKs:
https://aws.amazon.com/tools/

Ruby on Rails on few servers

I have a big application. One of the part of this is highload processing with user files. I decide to provide for this one dedicate server. There will be nginx for distribution content and some programs (non rails) for processing files.
I have two question:
What better to use on this server? (Rails or something else, maybe Sinatra)
If I'll use Rails how to deploy? I can't find any instruction. If I have one app and two servers how to deploy it and delegate task for each other?
ps I need to authorize user on both servers. In Rails I use Devise.
You can use Rails for this. If both servers will act as a web client to the end user then you'll need some sort of load balancer in front of the two servers. HAProxy does a great job on this.
As far as getting the two applications to communicate with each other, this will be less trivial than you may think. What you should do is use a locking mechanism on performing the tasks. Delayed_job by default will lock a job in the queue so that any other works will not try and work on the same job. You can use callbacks from ActiveJob to notify the user via web sockets whenever their job is completed.
Anything that will take time or calling an external API should usually be placed into a background processing queue so that you're not holding up the user.
If you cannot spin up more than the two servers, you should make one of them the master or at least have some clear roles of the two servers. For example, one server may be your background processing and memcache server while the other is storing your database and handles your web sockets.
There are a lot of different ways of configuring the services and anything including and beyond what I've mentioned is opinionated.
Having separate servers for handling tasks is my preference as it makes them easier to manage from a Sys Admin perspective. For example, if we find that our web sockets server is hammered, we can simply spin up a few more web socket servers and throw them into a load balancer pool. The end user would not be negatively impacted from your networking changes. Whereas, if you have your servers performing dual roles outside of your standard Rails installation, you may find yourself cloning and wasting resources. Each of my web servers usually also perform background tasks on low-intermediate priority queues while a dedicated server is left for handling mission critical jobs.

Amazon Web Service Micro Instance - Server Crash

I am currently using an AWS micro instance as a web server for a website that allows users to upload photos. Two questions:
1) When looking at my CloudWatch metrics, I have recently noticed CPU spikes, the website receives very little traffic at the moment, but becomes utterly unusable during these spikes. These spikes can last several hours and resetting the server does not eliminate the spikes.
2) Although seemingly unrelated, whenever I post a link of my website on Twitter, the server crashes (i.e.,Error Establishing a Database Connection). Once restarting Apache and MySQL, the website returns to normal functionality.
My only guess would be that the issue is somehow the result of deficiencies with the micro instance. Unfortunately, when I upgraded to the small instance, the site was actually slower due to fact that the micro instances can have two EC2 compute units.
Any suggestions?
If you want to stay in the free tier of AWS (micro instance), you should off load as much as possible away from your EC2 instance.
I would suggest you to upload the images directly to S3 instead of going through your web server (see some example for it here: http://aws.amazon.com/articles/1434).
S3 can also be used to serve most of your web pages (images, js, css...), instead of your weak web server. You can also add these files in S3 as origin to Amazon CloudFront (CDN) distribution to improve your application performance.
Another service that can help you in off loading the work is SQS (Simple Queue Service). Instead of working with online requests from users, you can send some requests (upload done, for example) as a message to SQS and have your reader process these messages on its own pace. This is good way to handel momentary load cause by several users working simultaneously with your service.
Another service is DynamoDB (managed NoSQL DB service). You can put on dynamoDB most of your current MySQL data and queries. Amazon DynamoDB also has a free tier that you can enjoy.
With the combination of the above, you can have your micro instance handling the few remaining dynamic pages until you need to scale your service with your growing success.
Wait… I'm sorry. Did you say you were running both Apache and MySQL Server on a micro instance?
First of all, that's never a good idea. Secondly, as documented, micros have low I/O and can only burst to 2 ECUs.
If you want to continue using a resource-constrained micro instance, you need to (a) put MySQL somewhere else, and (b) use something like Nginx instead of Apache as it requires far fewer resources to run. Otherwise, you should seriously consider sizing up to something larger.
I had the same issue: As far as I understand the problem is that AWS will slow you down when you reach a predefined usage. This means that they allow for a small burst but after that things will become horribly slow.
You can test that by logging in and doing something. If you use the CPU for a couple of seconds then the whole box will become extremely slow. After that you'll have to wait without doing anything at all to get things back to "normal".
That was the main reason I went for VPS instead of AWS.

Rails: Can I run backgrounds jobs in a different server?

Is it possible to host the application in one server and queue jobs in another server?
Possible examples:
Two different EC2 instances, one with the main server and the second with the queueing service.
Host the app in Heroku and use an EC2 instance with the queueing service
Is that possible?
Thanks
Yes, definitely. We have delayed_job set up that way where I work.
There are a couple of requirements for it to work:
The servers have to have synced clocks. This is usually not a problem as long as the server timezones are all set to the same.
The servers all have to access the same database.
To do it, you simply have the same application on both (or all, if more than two) servers, and start workers on whichever server you want to process jobs. Either server can still queue jobs, but only the one(s) with workers running will actually process them.
For example, we have one interface server, a db server and several worker servers. The interface server serves the application via Apache/Passenger, connecting the Rails application to the db server. The workers have the same application, though Apache isn't running and you can't access the application through http. They do, on the other hand, have delayed_jobs workers running. In a common scenario, the interface server queues up jobs in the db, and the worker servers process them.
One word of caution: If you're relying on physical files in your application (attachments, log files, downloaded XML or anything else), you'll most likely need a solution like S3 to keep those files. The reason for this is that the individual servers might not have the actual files. An example of this is if your user were to upload their profile picture on your web-facing server, the files would likely be stored on that server. If you then have another server to resize the profile pictures, the image wouldn't exist on the worker server.
Just to offer another viable option: you can use a new worker service like IronWorker that relies completely on an elastic farm of cloud servers inside EC2.
This way you can queue/schedule jobs to run and they will parallelize across tons of threads spanning multiple servers - all without worrying about the infrastructure.
Same deal with the database though - it needs to be accessible from the outside.
Full disclosure: I helped build IW.

Resources