Which web service to use as a cache storage? - ruby-on-rails

I am creating an application as a practice which will scrape a website. I want to cache scraped results using a web service, but I am not sure which one to use.
I have looked into Amazon's Elasticache and S3.
Elasticache seems like an overkill for this problem, but uses Redis in the background which will reduce my workload (I guess?).
S3 is not in-memory, but bigger issue for me is that I am not completely sure it is good solution for this kind of problem.
I don't need anything super fancy. I would like something easy to set up, yet efficient if that is possible.
So which one to chose? Are there any better alternatives?

Why do you think that Redis will reduce your work load? Redis is really powerful cache and will save your data really good. Elasticache of AWS can works with redis, should be really easy to setup and you only need control the what you will save there.

Related

share session among different type of web servers

Some web services in my company are built with different web apps.(Rails, Django, PHP)
What's the better practice to share the session status
So that user won't have to login again and again among different servers.
I build my Rails apps in AWS auto scaling group.
That is, even I browse the same website, but next time I may browser on another server, so that I have to login again. because the server doesn't have my session status.
I need some better idea or keywords for me to search about that kind of issue.
Thanks in advance
I can think of two ways in which you can achieve this objective
Implement a custom session handling mechanism that makes use of database session management, i.e. all sessions will be stored in a special table in the database and will be accessible to all the servers.
Have a Central Authentication Service (CAS) which will act as a proxy to all the other servers. This will then mean that this step has to happen before the requests reach the load balancer.
If you look around, option 1 might be recommended by many, but it may also be an overkill since you'll need custom session management in each of the servers. However, your choice would probably depend on the specific objectives you want to achieve, the overall flexibility of the system architecture and the amount of time you have on your hands. The CAS might be a more straightforward way of solving the problem.
Storing user sessions in your applications database wouldn't be recommended option for AWS.
The biggest problem with using a database, is that you need to write some clean up script that runs every so often to clear the table of all the expired user sessions. This is messy, creates more overhead, and puts more pressure on your DB.
However, if you do want to use an actual database for this, you should use a NoSQL database like Dynamo. This will give you much better performance than a relational database. It's probably more cost effective too in terms of data transfer. However, the biggest problem with this is that you still need that annoying clean up script. Note There is built in support in the SDK for using PHP with Dynamo for storing the user's session:
http://docs.aws.amazon.com/aws-sdk-php/v2/guide/feature-dynamodb-session-handler.html
The best but most costly solution is to use an ElastiCache cluster. You can set a TTL of your choice which means you won't have to worry about clean up scripts. Also ElastiCache will get you much better performance than Dynamo or any relational DB as the data is stored in the RAM of the ElastiCache nodes. The main drawback of ElastiCache is that currently, it can't scale dynamically. So if too many users logged in at once, if you didn't have a big enough cluster already provisioned, things could get ugly.
http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/WhatIs.html
But you can bet that all the biggest, baddest and best applications being hosted on AWS are either using Dynamo, ElastiCache or a custom NoSQL or Cache cluster solution for storing user sessions.
Note that both of these services are supported in all of the AWS SDKs:
https://aws.amazon.com/tools/

Does anyone know if it's possible to integrate carrierwave backgrounder (store_in_background) with Heroku?

https://github.com/lardawge/carrierwave_backgrounder
I would like to use store_in_background method for delaying storing the files to the S3 but I'm little bit afraid as Heroku is read only system and wondering if anyone has managed that ?
It would work iff you're using Heroku's newer stack which offers an ephemeral file system. I'd recommend something like queue_classic instead of carrierwave_backgrounder.
Queue Classic uses Postgres-specific features to deliver great performance. It also has the advantage of being able to be modified by postgres triggers/procedures. This allows you to queue an image delete when an image row is deleted in ONE query.

How to scale with Rails

I would like to prepare a rails application to be scalable. Some features of this app is to connect to some APIs & send emails, plus I'm using PostgreSQL & it's on Heroku.
Now that code is clean, I would like to use caches and any technique that will help the app to scale.
Should I use Redis or Memcached ? It's a little obscur to me and I've seen similar questions on StackOverflow but here I would like to know which one I should use only for scaling purpose.
Also I was thinking to use Sidekiq to process some jobs. Is it going to conflict with Memcached/Redis ? Also, in wich case should I use it ?
Any other things I should think of in terms of scalability ?
Many thanks
Redis is a very good choice for caching, it has similar performances to memcached (redis is slightly faster) and it takes few minutes to configure it that way.
If possibile I would suggest agains using the same redis instance to store both cache and message store.
If you really need to do that make sure you configure redis with volatile-lru max memory policy and that you always set your cache with a TTL; this way when redis runs out of memory cache keys will be evicted.
Sidekiq requires Redis as its message store. So you will need to have a Redis instance or use a Redis service if you want to use Sidekiq. Sidekiq is great, btw.
You can use either Memcached or Redis as your cache store. For caching I'd probably use Memcached as the cache cleaning behavior is better. In Rails in general, and in Rails 4 apps in particular, one rarely explicit expires an item from cache or sets an explicit expiration time. Instead one depends on updates to the cache_key, which means the cached item isn't actually deleted from the store. Memcached handles this pretty well, by evicting the least recently used items when it hits memory limits.
That all said, scaling is a lot more than picking a few components. You'll have to do a fair amount of planning, identify bottlenecks as the app develops, and scale CPU/memory/disk by increasing servers as needed.
You must look at the two cool features of Redis that are yet to come in near future
Redis Cluster:
http://redis.io/topics/cluster-spec
http://redis.io/presentation/Redis_Cluster.pdf
Redis Sentinel: (Highly Available)
http://redis.io/topics/sentinel
http://redis.io/topics/sentinel-spec
Also these features will yield a great help in scaling redis,
In case of memcached these features are missing, also as far as active development is concerned redis community is fare more vibrant.

What is the best approach to handle large file uploads in a rails app?

I am interested in understanding the different approaches to handling large file uploads in a Rails application, 2-5Gb files.
I understand that in order to transfer a file of this size it will need to be broken down into smaller parts, I have done some research and here is what I have so far.
Server-side config will be required to accept large POST requests and probably a 64bit machine to handle anything over 4Gb.
AWS supports multipart upload.
HTML5 FileSystemAPI has a persistent uploader that uploads the file in chunks.
A library for Bitorrent although this requires a transmission client which is not ideal
Can all of these methods be resumed like FTP, the reason I dont want to use FTP is that I want to keep in the web app if this is possible? I have used carrierwave and paperclip but I am looking for something that will be able to be resumed as uploading a 5Gb file could take some time!
Of these approaches I have listed I would like to undertand what has worked well and if there are other approaches that I may be missing? No plugins if possible, would rather not use Java Applets or Flash. Another concern is that these solutions hold the file in memory while uploading, that is also a constraint I would rather avoid if possible.
I've dealt with this issue on several sites, using a few of the techniques you've illustrated above and a few that you haven't. The good news is that it is actually pretty realistic to allow massive uploads.
A lot of this depends on what you actually plan to do with the file after you have uploaded it... The more work you have to do on the file, the closer you are going to want it to your server. If you need to do immediate processing on the upload, you probably want to do a pure rails solution. If you don't need to do any processing, or it is not time-critical, you can start to consider "hybrid" solutions...
Believe it or not, I've actually had pretty good luck just using mod_porter. Mod_porter makes apache do a bunch of the work that your app would normally do. It helps not tie up a thread and a bunch of memory during the upload. It results in a file local to your app, for easy processing. If you pay attention to the way you are processing the uploaded files (think streams), you can make the whole process use very little memory, even for what would traditionally be fairly expensive operations. This approach requires very little actual setup to your app to get working, and no real modification to your code, but it does require a particular environment (apache server), as well as the ability to configure it.
I've also had good luck using jQuery-File-Upload, which supports good stuff like chunked and resumable uploads. Without something like mod_porter, this can still tie up an entire thread of execution during upload, but it should be decent on memory, if done right. This also results in a file that is "close" and, as a result, easy to process. This approach will require adjustments to your view layer to implement, and will not work in all browsers.
You mentioned FTP and bittorrent as possible options. These are not as bad of options as you might think, as you can still get the files pretty close to the server. They are not even mutually exclusive, which is nice, because (as you pointed out) they do require an additional client that may or may not be present on the uploading machine. The way this works is, basically, you set up an area for them to dump to that is visible by your app. Then, if you need to do any processing, you run a cron job (or whatever) to monitor that location for uploads and trigger your servers processing method. This does not get you the immediate response the methods above can provide, but you can set the interval to be small enough to get pretty close. The only real advantage to this method is that the protocols used are better suited to transferring large files, the additional client requirement and fragmented process usually outweigh any benefits from that, in my experience.
If you don't need any processing at all, your best bet may be to simply go straight to S3 with them. This solution falls down the second you actually need to do anything with the files other than server them as static assets....
I do not have any experience using the HTML5 FileSystemAPI in a rails app, so I can't speak to that point, although it seems that it would significantly limit the clients you are able to support.
Unfortunately, there is not one real silver bullet - all of these options need to be weighed against your environment in the context of what you are trying to accomplish. You may not be able to configure your web server or permanently write to your local file system, for example. For what it's worth, I think jQuery-File-Upload is probably your best bet in most environments, as it only really requires modification to your application, so you could move an implementation to another environment most easily.
This project is a new protocol over HTTP to support resumable upload for large files. It bypass Rails by providing its own server.
http://tus.io/
http://www.jedi.be/blog/2009/04/10/rails-and-large-large-file-uploads-looking-at-the-alternatives/ has some good comparisons of the options, including some outside of Rails.
Please go through it.It was helpful in my case
Also another site to go to is:-
http://bclennox.com/extremely-large-file-uploads-with-nginx-passenger-rails-and-jquery
Please let me know if any of this does not work out
I would by-pass the rails server and post your large files(split into chunks) directly from the browser to Amazon Simple Storage. Take a look at this post on splitting files with JavaScript. I'm a little curious how performant this setup would be and I feel like tinkering with this setup this weekend.
I think that Brad Werth nailed the answer
just one approach could be upload directly to S3 (and even if you do need some reprocessing after you could theoretical use aws lambda to notify your app ... but to be honest I'm just guessing here, I'm about to solve the same problem myself, I'll expand on this later)
http://aws.amazon.com/articles/1434
if you use carrierwave
https://github.com/dwilkie/carrierwave_direct_example
Uploading large files on Heroku with Carrierwave
Let me also pin down few options that might help others looking for a real world solution.
I have a Rails 6 with Ruby 2.7 and the main purpose of this app is to create a Google drive like environment where users can upload images and videos and them process them again for high quality.
Obviously we did tried using local processing using Sidekiq background jobs but it was overwhelming during large uploads like 1GB and more.
We did tried tuts.io but personally I think is not quite easy to setup just like Jquery File uploads.
So we experimented with AWS..moving in steps listed below and it worked like a charm....uploading directly to S3 from the browser.
using React drop zone uploader...we uploads multiple files to S3.
we setup Aws Lambda for an input bucket to get triggered for all types of object creations on that bucket.
this Lambda converts the file and again uploads the reprocessed one to another one - output bucket and notifies us using Aws SNS to keep a track of what worked and what failed.
in Rails side... we just dynamically use the new output bucket and then serve it with Aws Cloud-front distribution.
You may check Aws notes on MediaConvert to check step by step guide and they also have a well written Github repos for all sorts of experimentation.
So, from the user's point of view, he can upload one large file, with Acceleration enabled on the S3, the React library show uploading progress and once it gets uploaded, Rails callback api again verifies its existence in the S3 BUCKET like mybucket/user_id/file_uploaded_slug and then its confirmed to user through a simple flash message.
You can also configure Lambda to notify end user on successful upload/encoding, if needed.
Refer this documentation - https://github.com/mike1011/aws-media-services-vod-automation/tree/master/MediaConvert-WorkflowWatchFolderAndNotification
Hope it helps someone here.

Redis and Memcache or just Redis?

I'm using memcached for some caching in my Rails 3 app through the simple Rails.cache interface and now I'd like to do some background job processing with redis and resque.
I think they're different enough to warrant using both. On heroku though, there are separate fees to use both memcached and redis. Does it make sense to use both or should I migrate to just using redis?
I like using memcached for caching because least recently used keys automatically get pushed out of the cache and I don't need the cache data to persist. Redis is mostly new to me, but I understand that it's persistent by default and that keys do not expire out of the cache automatically.
EDIT: Just wanted to be more clear with my question. I know it's feasible to use only Redis instead of both. I guess I just want to know if there are any specific disadvantages in doing so? Considering both implementation and infrastructure, are there any reasons why I shouldn't just use Redis? (I.e., is memcached faster for simple caching?) I haven't found anything definitive either way.
Assuming that migrating from memcached to redis for the caching you already do is easy enough, I'd go with redis only to keep things simple.
In redis persistence is optional, so you can use it much like memcached if that is what you want. You may even find that making your cache persistent is useful to avoid lots of cache misses after a restart. Expiry is available also - the algorithm is a bit different from memcached, but not enough to matter for most purposes - see http://redis.io/commands/expire for details.
I'm the author of redis-store, there is no need to use directly Redis commands, just use the :expires_in option like this:
ActionController::Base.cache_store = :redis_store, :expires_in => 5.minutes
The advantage of using Redis is fastness, and with my gem, is that you already have stores for Rack::Cache, Rails.cache or I18n.
I've seen a few large rails sites that use both Memcached and Redis. Memcached is used for ephemeral things that are nice to keep hot in memory but can be lost/regenerated if needed, and Redis for persistent storage. Both are used to take a load off the main DB for reading/write heavy operations.
More details:
Memcached: used for page/fragment/response caching and it's ok to hit the memory limit on Memcached because it will LRU (least recently used) to expire the old stuff, and frequently keep accessed keys hot in memory. It's important that anything in Memcached could be recreated from the DB if needed (it's not your only copy). But you can keep dumping things into it, and Memcached will figure which are used most frequently and keep those hot in memory. You don't have to worry about removing things from Memcached.
redis: you use this for data that you would not want to lose, and is small enough to fit in memory. This usually includes resque/sidekiq jobs, counters for rate limiting, split test results, or anything that you wouldn't want to lose/recreate. You don't want to exceed the memory limit here, so you have to be a little more careful about what you store and clean up later.
Redis starts to suffer performance problems once it exceeds its memory limit (correct me if I'm wrong). It's possible to solve this by configuring Redis to act like Memcached and LRU expire stuff, so it never reaches its memory limit. But you would not want to do this with everything you are keeping in Redis, like resque jobs. So instead of people often keep the default, Rails.cache set to use Memcached (using the dalli gem). And then they keep a separate $redis = ... global variable to do redis operations.
# in config/application.rb
config.cache_store = :dalli_store # memcached
# in config/initializers/redis.rb
$redis = $redis = Redis.connect(url: ENV['REDIS_URL'])
There might be an easy way to do this all in Redis - perhaps by having two separate Redis instances, one with an LRU hard memory limit, similar to Memcache, and another for persistent storage? I haven't seen this used, but I'm guessing it would be doable.
I would consider checking out my answer on this subject:
Rails and caching, is it easy to switch between memcache and redis?
Essentially, through my experience, I would advocate for keeping them separate: memcached for caching and redis for data structures and more persistant storage
I asked the team at Redis Labs (who provide the Memcached Cloud and Redis Cloud add ons) about which product they would recommend for Rails caching. They said that in general they would recommend Redis Cloud, that Memcached Cloud is mainly offered for legacy purposes, and pointed out that their Memcached Cloud service is in fact build on top of Redis Cloud.
I don't know what you're using them for, but actually using both may give you a performance advantage: Memcached has far better performance running across multiple cores than Redis, so caching the most important data with Memcached and keeping the rest in Redis, taking advantage of its capabilities as database, could increase performance.

Resources