Thinking Sphinx & Rails questions - ruby-on-rails

I'm building my first Rails app and have it working great with Thinking Sphinx. I'm understanding most of it but would love it if someone could help me clarify a few conceptual questions
When displaying search results after a sphinx query, should I be using the sphinx_attributes that are returned from the sphinx query? Or should my view use normal rails objects, such as #property.title, #property.amenities.title etc? If I use normal rails objects, doesn't that mean its doing extra queries?
In a forum, I'd like to display 'unread posts'. Obviously this is true/false for each user/topic combination, so I'm thinking I should be caching the 'reader' ids within the topic's sphinx index. This way I can quickly do a query for all unread posts for a given user_id. I've got this working, but then realised its pointless, as there is a time delay between sphinx indexes. So if a user clicks on an unread post, it will still appear unread until the sphinx DB is re-indexed
I'm still on development so I'm manually indexing/rebuilding, but on production, what is a standard time between re-indexing?
I have a model with several text fields - should I concat these all into one column in the sphinx index for a keyword search? Surely this is quicker than indexing all the separate fields.
Slightly off-topic, but just wondering - when you access nested models, for example #property.agents.name, does this affect performance? Or does rails automatically fetch all associated entries when a property is pulled from the database?

To answer each of your points:
For both of your examples, sphinx_attributes would not be helpful. Firstly, you've already loaded the property, so the title is available directly without an extra database hit. And for property.amenities.title you're dealing with an array of strings, which Sphinx has no concept of. Generally, I would only use sphinx_attributes for complicated calculated attributes, not standard column references.
Yes, you're right, there will be a delay with this value.
It depends on how often your data changes. I have some apps where I can index every day because changes are so rare, but others where we'll run it every 10 minutes. If the data is particularly volatile, I'll look at using deltas (usually via Sidekiq) to have changes reflected in Sphinx in a few seconds.
I don't think it's much difference either way - unless you want to search on any of those columns separately? If so, it'll need to be a separate field.
By default, as you use each property's agents, the agents for that property will be loaded from the database (one SQL call per property). You could look at the eager loading docs for how to manage this better when you're dealing with multiple records. Thinking Sphinx has the ability to pass through :include options to the underlying ActiveRecord call.

Related

Faceting in Solr when index contains millions of documents

I'm working on a project that uses a solr index with a few million documents and we've recently hit a memory problem. Faceting has become unusable on a couple of our fields - solr runs out of heap memroy - because of the number of documents containing those fields.
What options do we have besides increasing the memory? We see memory increases as a temporary solution because the number of documents goes up by a few 100k documents per day.
I'm looking at the minute into solrcloud but I'm not sure this is the right solution.
Any suggestions?
Thanks!
FacetFields: Allow for facet counts based on distinct values in a field. There are two methods for FacetFields, one that performs well with few distinct values in a field, and the other for when a field contains many distinct values (generally, thousands and up – you should test what works best for you).
The first method, facet.method=enum, works by issuing a FacetQuery for every unique value in the field. As mentioned, this is an excellent method when the number of distinct values in a field is small. It requires excessive memory though, and breaks down when the number of distinct values gets large. When using this method, be careful to ensure that your FilterCache is large enough to contain at least one filter for every distinct value you plan on faceting on.
The second method uses the Lucene FieldCache (future version of Solr will actually use a different non-inverted structure – the UnInvertedField). This method is actually slower and more memory intensive for fields with a low number of unique values, but if you have a lot of uniques, this is the way to go. This method uses the FieldCache to look up the values for the given field for each document, and every time a document with a given value is found, the value has its count incremented.
Please check the allotted memory for each cache and if you can tweak FieldCache to handle the situation. (As you have mentioned, type3 and type4 have large number of documents.
Source for the above information is Scaling Lucene and Solr. I found one more article which talks about solr faceting You are faceting it wrong.
Before solrcould you can think of solr multiple core.
On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores.
With SolrCloud, a single index can span multiple Solr instances.
This means that a single index can be made up of multiple SolrCore's on different machines.
These SolrCores that make up one logical index a collection.
A collection is a essentially a single index that spans many SolrCore's, both for index scaling as well as redundancy.
If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.
SolrCloud adds the distributed capabilities in Solr.
With this enable you can have highly available, fault tolerant cluster of Solr servers.
Use SolrCloud when you want high scale, fault tolerant, distributed indexing and search capabilities.
You can get more info about SolrCloud here
https://cwiki.apache.org/confluence/display/solr/SolrCloud

Performing multiple queries on the same model efficiently

I've been going round in circles for a few days trying to solve a problem which I've also struggled with in the past. Essentially its an issue of understanding the best (or an efficient) way to perform multiple queries on a model as I'm regularly finding my pages are very slow to load.
Consider the situation where you have a model called Everything. Initially you perform a query which finds those records in Everything which match certain criteria
#chosenrecords = Everything.where('name LIKE ?', 'What I want').order('price ASC')
I want to remember the contents of #chosenrecords as I will present them to the user as a list, however, I would also like to understand more of the attributes of #chosenrecords,for instance
#minimumprice = #chosenrecords.first
#numberofrecords = #chosenrecords.count
When I use the above code in my controller and inspect the command history on the local server, I am surprised to find that each of the three queries involves an SQL query on the original Everything model, rather than remembering the records returned in #chosenrecords and performing the query on that. This seems very inefficient to me and indeed each of the three queries takes the same amount of time to process, making the page perform slowly.
I am more experienced in writing codes in software like MATLAB where once you've calculated the value of a variable it is stored locally and can be quickly interrogated, rather than recalculating that variable on each occasion you want to know more information about it. Please could you guide me as to whether I am just on the wrong track completely and the issues I've identified are just "how it is in Rails" or whether there is something I can do to improve it. I've looked into concepts like using a scope, defining a different variable type, and caching, but I'm not quite sure what I'm doing in each case and keep ending up in a similar hole.
Thanks for your time
You are partially on the wrong track. Rails 3 comes with Arel, which defer the query until data is required. In your case, you have generated Arel query but executing it with .first & then with .count. What I have done here is run the first query, get all the results in an array and working on that array in next two lines.
Perform the queries like this:-
#chosenrecords = Everything.where('name LIKE ?', 'What I want').order('price ASC').all
#minimumprice = #chosenrecords.first
#numberofrecords = #chosenrecords.size
It will solve your issue.

Exporting and/or displaying multiple records in Rails

I have been working in Rails (I mean serious working) for last 1.5 years now. Coming from .Net background and database/OLAP development, there are many things I like about Rails but there are few things about it that just don't make sense to me. I just need some clarification for one such issue.
I have been working on an educational institute's admission process, which is just a small part of much bigger application. Now, for administrator, we needed to display list of all applied/enrolled students (which may range from 1000 to 10,000), and also give a way to export them as excel file. For now, we are just focusing on exporting in CSV format.
My questions are:
Is Rails meant to display so many records at the same time?
Is will_paginate only way to paginate records in Rails? From what I understand, it still fetches all the records from DB, and then selectively displays relevant records. Back in .Net/PHP/JSP, we used to create stored procedure and from there we selectively returns relevant records. Since, using stored procedure being a known issue in Rails, what other options do we have?
Same issue with exporting this data. I benchmarked the process i.e. receiving request at the server, execution of the query and response return. The ActiveRecord creation was taking a helluva time. Why was that? There were only like 1000 records, and the page showed connection timeout at the user. I mean, if connection times-out while working on for 1000 records, then why use Rails or it means Rails are not meant for such applications. I have previously worked with TB's of data, and never had this issue.
I never understood ORM techniques at the core. Say, we have a table users, and are associated with multiple other tables, but for displaying records, we need data from only tables users and its associated table admissions, then does it actually create objects for all its associated tables. I know, the data will be fetched only if we use the association, but does it create all the objects before-hand?
I hope, these questions are not independent and do qualify as per the guidelines of SF.
Thank you.
EDIT: Any help? I re-checked and benchmarked again, for 1000 records, where in we are joining 4-5 different tables (1000 users, 2-3 one-to-one association, and 2-3 one-to-many associations), it is creating more than 15000 objects. This is for eager loading. As for lazy loading, it will be 1000 user query plus some 20+ queries). What are other possible options for such problems and applications? I know, I am kinda bumping the question to come to top again!
Rails can handle databases with TBs of data.
Is will_paginate only way to paginate records in Rails?
There are many other gems like "kaminari".
it fetches all records from the db..
NO. It doesnt work that way. For example take the following query,Users.all.page(1).per(10)
User.all wont fire a db query, it will return a proxy object. And you call page(1) and per(10) on the proxy(ActiveRecord::Relation). When you try to access the data from the proxy object, it will execute a db query. Active record will accumulate all conditions and paramaters you pass and will execute a sql query when required.
Go to rails console and type u= User.all; "f"; ( the second statement: "f", is to prevent rails console from calling to_s on the proxy to display the result.)
It wont fire any query. Now try u[0], it will fire a query.
ActiveRecord creation was taking a helluva time
1000 records shouldn't take much time.
Check the number of sql queries fired from the db. Look for signs of
n+1 problem and fix them by eager loading.
Check the serialization of the records to csv format for any cpu or memory intensive operation.
Use a profiler and track down the function that is consuming most of the time.

DB-agnostic Calculations : Is it good to store calculation results ? If yes, what's the better way to do this?

I want to perform some simple calculations while staying database-agnostic in my rails app.
I have three models:
.---------------. .--------------. .---------------.
| ImpactSummary |<------| ImpactReport |<----------| ImpactAuction |
`---------------'1 *`--------------'1 *`---------------'
Basicly:
ImpactAuction holds data about... auctions (prices, quantities and such).
ImpactReport holds monthly reports that have many auctions as well as other attributes ; it also shows some calculation results based on the auctions.
ImpactSummary holds a collection of reports as well as some information about a specific year, and also shows calculation results based on the two other models.
What i intend to do is to store the results of these really simple calculations (just means, sums, and the like) in the relevant tables, so that reading these would be fast, and in a way that i can easilly perform queries on the calculation results.
is it good practice to store calculation results ? I'm pretty sure that's not a very good thing, but is it acceptable ?
is it useful, or should i not bother and perform the calculations on-the-fly?
if it is good practice and useful, what's the better way to achieve what i want ?
Thats the tricky part.At first, i implemented a simple chain of callbacks that would update the calculation fields of the parent model upon save (that is, when an auction is created or updated, it marks some_attribute_will_change! on its report and saves it, which triggers its own callbacks, and so on).
This approach fits well when creating / updating a single record, but if i want to work on several records, it will trigger the calculations on the whole chain for each record... So i suddenly find myself forced to put a condition on the callbacks... depending on if i have one or many records, which i can't figure out how (using a class method that could be called on a relation? using an instance attribute #skip_calculations on each record? just using an outdated field to mark the parent records for later calculation ?).
Any advice is welcome.
Bonus question: Would it be considered DB agnostic if i implement this with DB views ?
As usual, it depends. If you can perform the calculations in the database, either using a view or using #find_by_sql, I would do so. You'll save yourself a lot of trouble: you have to keep your summaries up to date when you change values. You've already met the problem when updating multiple rows. Having a view, or a query that implements the view stored as text in ImpactReport, will allow you to always have fresh data.
The answer? Benchmark, benchmark, benchmark ;)

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources