I have a Rails app that searches a static set of documents, and I need to figure out the best way to cache result sets. I can't just use the native ActiveRecord caching.
My situation:
I'm using the will_paginate gem, and at the moment, the query is running every time the user changes pages. Some queries are too complex for this to be responsive, so I need to cache the results, at least during an individual session. Also, the user may have multiple tabs open, running separate queries simultaneously. The document set is static; the size of the set is on the order of tens of thousands of documents.
Why straight-forward ActiveRecord caching won't work:
The user can search the contents of the documents, or search based on metadata restrictions (like a date range), or both. The metadata for each document is stored in an ActiveRecord, so those criteria are applied with an ActiveRecord query.
But if they add a search term for the document content, I run that search using a separate FastCGI application, because I'm doing some specialized search logic. So, I pass the term & the winnowed-down document list to the FastCGI application, which responds with the final result list. Then I do another ActiveRecord query: where("id IN (?)',returnedIds)
By the way, it's these FastCGI searches that are sometimes complex enough to be unresponsive.
My thoughts:
There's the obvious-to-a-newbie approach: I can use the metadata restrictions plus the search term as a key; they're already stored in a hash. They'd be paired up with the returnedIds array. And this guide at RubyOnRails.org mentions the cache stores that are available. But it's not clear to me which store is best, and I'm also assuming there's a gem that's better for this.
I found the gem memcached, but it's not clear to me whether it would work for caching the results of my FastCGI request.
Related
I have an app that allows users to sort and filter through 30,000 items of data. Right now I make fetch requests from Redux actions to my rails API, with the queries being handled by scope methods on my rails end. My instructor is recommending that I move all my querying to my front-end for efficiency, but I'm wondering if it really will be more performant to manage a Redux state object with 30,000 objects in it, each with 50 of their own attributes.
(A couple extra notes: Right now I've only run the app locally and I'm doing the pagination server-side so it runs lightning fast, but I'm a bit nervous about when I launch it somewhere like Heroku. Also, I know that if I move my querying to the front-end I'll have more options to save the query state in the URL with react-router, but I've already sort of hacked a way around that with my existing set-up.)
Let's have a look at the pros and cons of each approach:
Querying on Front End
👍 Querying does not need another network request
👎 Network requests are slower because there is more data to send
👎 App must store much more data in memory
👎 Querying is not necessarily more efficient because the client has to do the filtering and it usually does not have the mechanisms to do so effectively (caching and indexing).
Querying on Back End
👍 Less data to send to client
👍 Querying can be quite fast if database indexes are set up properly
👍 App is more lightweight, it only holds the data it needs to display
👎 Each query will require a network request
The pros of querying on Back End heavily outweighs that on Front End. I would have to disagree with your instructor's opinion. Imagine you want to search for something on Google and Google sends all relevant results you want to your browser and does the pagination and sorting within your browser, your browser would feel extremely sluggish. With proper caching and adding database indexes to your data, network requests will not be a huge disadvantage.
I was browsing reddit for the answer to this and came across this conversation which lists out a bunch of search gems for rails, which is cool. But what I wanted was something where I could:
Enter: OMG Happy Cats
It searches the whole database looking for anything that has OMG Happy Cats and returns me a an array of model objects that contain that value, that I can then use Active model serializer (Very important to be able to use this) on to return you a json object of search results so you can display what ever you want to the user.
So that json object, if this was a blog, would have a post object, maybe a category object and even a comment object.
Everything I have seen is very specific to one controller, one model. Which is nice an all but I am more of a "search for what you want, we will return you what you want, maybe grow smarter like this gem, searchkick which also has the ability to offer spelling suggestion.
I am building this with an API, so it would be limited to everything that belongs to a blog object (as to make it not so huge of a search), so it would search things like posts, tags, categories, comments and pages looking for your term, return a json object (as described) and boom done.
Any ideas?
You'll be best considering the underlying technology for this
--
Third Party
As far as I know (I'm not super experienced in this area), the best way to search an entire Rails database is to use a third party system to "index" the various elements of data you require, allowing you to search them as required.
Some examples of this include:
Sunspot / Solr
ElasticSearch
Essentially, having one of these "third party" search systems gives you the ability to index the various records you want in a separate database, which you can then search with your application.
--
Notes
There are several advantages to handling "search" with a third party stack.
Firstly, it takes the load off your main web server - which means it'll be more reliable & able to handle more traffic.
Secondly, it will ensure you're able to search all the data of your application, instead of tying into a particular model / data set
Thirdly, because many of these third party solutions index the content you're looking for, it will free up your database connectivity for your actual application, making it more efficient & scaleable
For PostgreSQL you should be able to use pg_search.
I've never used it myself but going by the documentation on GitHub, it should allow you to do:
documents = PgSearch.multisearch('OMG Happy Cats').to_a
objects = documents.map(&:searchable)
groups = objects.group_by{|o| o.class.name.pluralize.downcase}
json = Hash[groups.map{|k,v| [k,ActiveModel::ArraySerializer.new(v).as_json]}].as_json
puts json.to_json
We've built a dynamic questionnaire with a Angular front-end and RoR backend. Since there are a lot of dynamic parts to handle, it was impossible to utilise ActionView or jbuilder cache helpers. With each Questionnaire request, there are quite a lot of queries to be done, such as checking validity of answers, checking dependencies, etc. Is there a recommended strategy to cache dynamic JSON responses?
To give an idea..
controller code:
def advance
# Decrypt and parse parameters
request = JSON.parse(decrypt(params[:request]))
# Process passed parameters
if request.key?('section_index')
#result_set.start_section(request['section_index'].to_i)
elsif request.key?('question_id')
if valid_answer?(request['question_id'], request['answer_id'])
#result_set.add_answer(request['question_id'],
request['answer_id'],
request['started_at'],
request['completed_at'])
else
return invalid_answer
end
end
render_item(#result_set.next_item)
end
The next_item could be a question or section, but progress indicator data and possibly a previously given answer (navigation is possible) are returned as well. Also, data is sent encrypted from and to the front-end.
We've also built an admin area with an Angular front-end. In this area results from the questionnaire can be viewed and compared. Quite some queries are being done to find subquestions, comparable questions etc. Which we found hard to cache. After clicking around with multiple simultaneous users, you could fill up the servers memory.
The app is deployed on Passenger and we've fine-tuned the config based on the server configuration. The results are stored in a Postgres database.
TLDR: In production, we found out memory usage becomes an issue. Some optimisations to queries (includes specifically) are possible, but is there a recommended strategy for caching dynamic JSON responses?
Without much detail as to how you are storing and retrieving your data, it is a bit tough. But what it sounds like you are saying is that your next_item method is CPU and memory intensive to try to find the next item. Is that correct? Assuming that, you might want to take a look at a Linked List. Each node (polymorphic) would have a link to the next node. You could implement it as a doubly linked list if you needed to step forward and backwards.
How often does the data change? If you have can cache big parts of it and you'll find a trigger attribute (e.g. updated_at) that you can do fragment caching in the view. Or even better to HTTP caching in the controller. You can mix both.
It's a bit complex. Please have a look at http://www.xyzpub.com/en/ruby-on-rails/4.0/caching.html
I want to be able to store media RSS and iTunes podcast RSS feeds into the database. The requirement here is that I don't want to miss out on ANY element or its attributes in the feed. It would make sense to find all most common elements in the feed and have them stored in database as separate columns. The catch here is that there can be feed specific elements that may not be standard. I want to capture them too. Since I don't know what they can be, I won't have a dedicated column for them.
Currently I have 2 tables called feeds and feed_entries. For RSS 2.0 tags like enclosures, categories, I have separate tables that have associations with feeds/feed_entries. I am using feedzirra for parsing the feeds. Feedzirra requires us to know the elements in the feed we want to parse and hence we would not know if feed contains elements beyond what feedzirra can understand.
What would be the best way to go about storing these feeds in the database and not miss single bit of information? (Dumping of the whole feed into the database as is won't work as we want to query most of the attributes). What parser would be the best fit? Feedzirra was chosen for performance, however, getting all data in the feed into the database is a priority.
Update
I'm using MySQL as the database.
I modeled my database on feeds and entries also, and cross-mapped the fields for RSS, RDF and Atom, so I could capture the required data fields as a starting point. Then I added a few others for tagging and my own internal-summarizations of the feed, plus some housekeeping and maintenance fields.
If you move from Feedzirra I'd recommend temporarily storing the actual feed XML in a staging table so you can post-process it using Nokogiri at your leisure. That way your HTTP process isn't bogged down processing the text, it's just retrieving content and filing it away, and updating the records for the processing time so you know when to check again. The post process can extract the feed information you want from the stored XML to store in the database, then delete the record. That means there's one process pulling in feeds periodically as quickly as it can, and another that basically runs in the background chugging away.
Also, both Typhoeus/Hydra and HTTPClient can handle multiple HTTP requests nicely and are easy to set up.
Store the XML as a CLOB, most databases have XML processing extensions that allow you to include XPath type queries as part of a SELECT statement.
Otherwise if your DBMS does not support XML querying, use your languages XPath implementation to query the CLOB. You will probably need to extract certain elements into table columns for speedy querying.
I am building a Ruby on Rails application where I need to be able to consume a REST API to fetch some data in (Atom) feed format. The REST API has a limit to number of calls made per second as well as per day. And considering the amount of traffic my application may have, I would easily be exceeding the limit.
The solution to that would be to cache the REST API response feed locally and expose a local service (Sinatra) that provides the cached feed as it is received from the REST API. And of course a sweeper would periodically refresh the cached feed.
There 2 problems here.
1) One of the REST APIs is a search API where search results are returned as an ATOM feed. The API takes in several parameters including the search query. What should be my caching strategy so that cached feed can be uniquely identified against the parameters? That is, for example, if I search for say
/search?q=Obama&page=3&per_page=25&api_version=4
and I get a feed response for these parameters. How do I cache the feed so that for the exact same parameters passed in a call some time later, the cached feed is returned and if the parameters change, a new call should be made to the REST API?
2) The other problem is regarding the sweeper. I don't want to sweep a cached feed which is rarely used. That is, search query Best burgers in Somalia would obviously be very less wanted than say Barak Obama. I do have the data of how many consumers have subscribed to the feed. The strategy here should be that given the number of subscribers to this search query, sweep the cached feeds based on how large this number is. Since the caching needs to happen in the Sinatra application, how would one go about implementing this kind of sweeping strategy? Some code will help.
I am open to any ideas here. I want these mechanisms to be very good on performance. Ideally I would want to do this without database and by pure page caching. However, I am open to possibility of trying other things.
Why would you want to replicate the REST service as a Sinatra app? You could easily just make a model inside your existing Rails app to cache the Atom feeds (storing the whole feed as a string inside for example).
a CachedFeed Model which is updated when its "updated_at" is far enough away to be renewed.
You could even use static caching for your cachedFeed Controller to reduce the strain on your system.
Having the cache inside your Rails app would greatly reduce complexity in terms of when to renew your cache or even count the requests performed against the rest api you query.
You could have model logic to distribute the calls you have to the most popular feeds. Tthe search parameter could just an attribute of your model so you can easily find and distinguish them