Cache and Paginate RSS Feed in Rails - ruby-on-rails

I'm working on a life-streaming type app in Rails and am constantly parsing several different RSS feeds using Feedzirra. To support scaling of the app, I have to implement some kind of caching while also allowing to paginate the cached results in the view. The cache can expire as little as once a day.
Being a novice to caching in Rails, what types of caching would be recommended for this? Also, I'm doing most of the feed parsing operations in modules in my lib/ directory, not sure if this would have affect / not be ideal for caching. Ideally I'd like to cache the array of results the RSS feed returns and do some post-processing to them before I send it to the view.
Any help would be much appreciated!

I suggest you to use a gem for run a schedule task in the cron, collect all desired results from all your rss feeds and save it to an xml or even in a table.
For the next time, load the results from this xml or table and create an static cached pages (html files).
And everytime you run your schedule task, erase your previous saved files, preventing old results tyo be displayed.

Related

Rails: Cache multiple result sets during a session

I have a Rails app that searches a static set of documents, and I need to figure out the best way to cache result sets. I can't just use the native ActiveRecord caching.
My situation:
I'm using the will_paginate gem, and at the moment, the query is running every time the user changes pages. Some queries are too complex for this to be responsive, so I need to cache the results, at least during an individual session. Also, the user may have multiple tabs open, running separate queries simultaneously. The document set is static; the size of the set is on the order of tens of thousands of documents.
Why straight-forward ActiveRecord caching won't work:
The user can search the contents of the documents, or search based on metadata restrictions (like a date range), or both. The metadata for each document is stored in an ActiveRecord, so those criteria are applied with an ActiveRecord query.
But if they add a search term for the document content, I run that search using a separate FastCGI application, because I'm doing some specialized search logic. So, I pass the term & the winnowed-down document list to the FastCGI application, which responds with the final result list. Then I do another ActiveRecord query: where("id IN (?)',returnedIds)
By the way, it's these FastCGI searches that are sometimes complex enough to be unresponsive.
My thoughts:
There's the obvious-to-a-newbie approach: I can use the metadata restrictions plus the search term as a key; they're already stored in a hash. They'd be paired up with the returnedIds array. And this guide at RubyOnRails.org mentions the cache stores that are available. But it's not clear to me which store is best, and I'm also assuming there's a gem that's better for this.
I found the gem memcached, but it's not clear to me whether it would work for caching the results of my FastCGI request.

Heroku Request Timeout (H12) with Big Data

I have a Ruby on Rails application which gets lots of data from social media sites like Twitter, Facebook etc.
There is an index page that shows records as paged. I am using Kaminari for paging.
My issue is big data, I guess. Let's say I have millions of records and want to show them on my index page with Kaminari. When I tried to run the system by browser, Heroku gives me H12 error (request timeout).
What can I do to improve my app's performance? I have this idea of getting only the records that will be shown on the index page. Likewise, when clicked to Kaminari second page link, only fetching the second page records from database. Idea is basically that but I don't know where to start and how to implement it.
Here an example piece of code from my controller:
#ca_responses = #ca_responses_for_adaptors.where(:ca_request_id => #conditions)
.order(sort_column + " " + sort_direction)
.page(params[:page]).per(5)
#ca_responses: My records
#ca_responses_for_adaptor: Records based on adaptor. Think as admin and this returns all of the records.
#conditions: Getting specified adaptor records. For example getting only Twitter related records etc.
You could start by creating a page cache table which will be filled in with your data for your search results. That could be one approach.
There could be few downsides, but if I would know the exact problem, then I could propose better solution. I doubt that you will be listing million users on one page and then to access them by paginating the pages (?) or I am mistaken
EDIT:
There could be few problems with pagination. First is that the paginating gems work like this: They fetch all data, and then when you click on page number it only fetches the second 5 elements (or however you have set it) from the whole list. The problem here is fetching all the data before paginating. If you have a million of records, then this could take a while for every page. You could define new method that will run SQL query to select one amount of data from the database , and you can set offset instruction to fetch the data only for that page. In this case paginate gem is useless, so you would need to remove it.
The second option is that you could use something like user_cashe, something like that. By this I mean to create new table that will have just a few records - the records that will be displayed on the screen. The table will be smaller then the usuall user table, and then, it would be faster to search trough it.
There could be other more advanced solutions, but I doubt you could (want) to use it in your application.
Kaminari already paginates your records as expected.
Heroku is prone to random timeout errors due to its random router.
Try to reproduce on local. You may have bottlenecks in your code which make indeed your request being too long to return. You should not have any problem requesting 5 items from database, so you may have code before or after that that takes long time to run.
If everything is ok on local with production data, you may add new_relic to analyze your requests and see whether some problem occurs specifically on production (and why).
If it appears heroku router is indeed the problem, you can still try to use unicorn as a webserver, but you have to take special care that your app does not consume too much memory (each unicorn worker will consume the ram of a whole app, and you may hit heroku memory limits, which would produce R14 errors in place of those H12).

Storing Media RSS and iTunes podcast RSS feeds in the database

I want to be able to store media RSS and iTunes podcast RSS feeds into the database. The requirement here is that I don't want to miss out on ANY element or its attributes in the feed. It would make sense to find all most common elements in the feed and have them stored in database as separate columns. The catch here is that there can be feed specific elements that may not be standard. I want to capture them too. Since I don't know what they can be, I won't have a dedicated column for them.
Currently I have 2 tables called feeds and feed_entries. For RSS 2.0 tags like enclosures, categories, I have separate tables that have associations with feeds/feed_entries. I am using feedzirra for parsing the feeds. Feedzirra requires us to know the elements in the feed we want to parse and hence we would not know if feed contains elements beyond what feedzirra can understand.
What would be the best way to go about storing these feeds in the database and not miss single bit of information? (Dumping of the whole feed into the database as is won't work as we want to query most of the attributes). What parser would be the best fit? Feedzirra was chosen for performance, however, getting all data in the feed into the database is a priority.
Update
I'm using MySQL as the database.
I modeled my database on feeds and entries also, and cross-mapped the fields for RSS, RDF and Atom, so I could capture the required data fields as a starting point. Then I added a few others for tagging and my own internal-summarizations of the feed, plus some housekeeping and maintenance fields.
If you move from Feedzirra I'd recommend temporarily storing the actual feed XML in a staging table so you can post-process it using Nokogiri at your leisure. That way your HTTP process isn't bogged down processing the text, it's just retrieving content and filing it away, and updating the records for the processing time so you know when to check again. The post process can extract the feed information you want from the stored XML to store in the database, then delete the record. That means there's one process pulling in feeds periodically as quickly as it can, and another that basically runs in the background chugging away.
Also, both Typhoeus/Hydra and HTTPClient can handle multiple HTTP requests nicely and are easy to set up.
Store the XML as a CLOB, most databases have XML processing extensions that allow you to include XPath type queries as part of a SELECT statement.
Otherwise if your DBMS does not support XML querying, use your languages XPath implementation to query the CLOB. You will probably need to extract certain elements into table columns for speedy querying.

Need caching techniques for REST service calls

I am building a Ruby on Rails application where I need to be able to consume a REST API to fetch some data in (Atom) feed format. The REST API has a limit to number of calls made per second as well as per day. And considering the amount of traffic my application may have, I would easily be exceeding the limit.
The solution to that would be to cache the REST API response feed locally and expose a local service (Sinatra) that provides the cached feed as it is received from the REST API. And of course a sweeper would periodically refresh the cached feed.
There 2 problems here.
1) One of the REST APIs is a search API where search results are returned as an ATOM feed. The API takes in several parameters including the search query. What should be my caching strategy so that cached feed can be uniquely identified against the parameters? That is, for example, if I search for say
/search?q=Obama&page=3&per_page=25&api_version=4
and I get a feed response for these parameters. How do I cache the feed so that for the exact same parameters passed in a call some time later, the cached feed is returned and if the parameters change, a new call should be made to the REST API?
2) The other problem is regarding the sweeper. I don't want to sweep a cached feed which is rarely used. That is, search query Best burgers in Somalia would obviously be very less wanted than say Barak Obama. I do have the data of how many consumers have subscribed to the feed. The strategy here should be that given the number of subscribers to this search query, sweep the cached feeds based on how large this number is. Since the caching needs to happen in the Sinatra application, how would one go about implementing this kind of sweeping strategy? Some code will help.
I am open to any ideas here. I want these mechanisms to be very good on performance. Ideally I would want to do this without database and by pure page caching. However, I am open to possibility of trying other things.
Why would you want to replicate the REST service as a Sinatra app? You could easily just make a model inside your existing Rails app to cache the Atom feeds (storing the whole feed as a string inside for example).
a CachedFeed Model which is updated when its "updated_at" is far enough away to be renewed.
You could even use static caching for your cachedFeed Controller to reduce the strain on your system.
Having the cache inside your Rails app would greatly reduce complexity in terms of when to renew your cache or even count the requests performed against the rest api you query.
You could have model logic to distribute the calls you have to the most popular feeds. Tthe search parameter could just an attribute of your model so you can easily find and distinguish them

How to cache queries in Rails across multiple requests

I want to cache query results so that the same results are fetched "for more than one request" till i invalidate the cache. For instance, I want to render a sidebar which has all the pages of a book, much like the index of a book. As i want to show it on every page of the book, I have to load it on every request. I can cache the rendered sidebar index using action caching, but i also want to actually cache the the query results which are used to generate the html for the sidebar. Does Rails provide a way to do it? How can i do it?
You can cache the query results using ActiveSupport's cache store, which can by backed by a memory store such as memcached, or a database store if you provide your own implementation. Note that you'll want to use a database store if the cache is shared across multiple Ruby processes which will be the case if you're deploying to Mongrel Cluster or Phusion Passenger.
This Railscast has the details
You also could check for your action cache before querying the database in the controller.

Resources