problems with bulk insertion and "bulk validation" in Rails - ruby-on-rails

I'm using ar-extensions' import feature to do bulk import and it's quick, but not as quick as I'd like. Two problems I am seeing from the logs:
I still see individual SQL insert statements - why isn't it doing multirow insertion?
I have a :validates_uniqueness_of and I see that it does a SELECT for every row to do it. Is there a "bulk validation" way it could just select everything with a WHERE clause and validate the uniqueness that way instead?
I'm hesitant to drop down to SQL for doing this, so any suggestions - or using a different gem/plugin? Thanks!

I use an instance_attribute (#bulk_loading) in my model for when I'm doing bulk insertions. If the variable is true, then some of the validations are not run.
As egarcia says, currently AR doesn't support multirow insertions.
validate_uniqueness_of validation speedups
1) create a unique index in the dbms to do the check and do NOT use AR for the check. -- Just catch the appropriate error from the dbms driver when the insert fails due to violating the unique index
2) create a non-unique index in the db for the uniqueness_validation and check (using db query analysis techniques) that the index is being used and the sql uniqueness check is executing as fast as possible.

I use FasterCSV on a couple projects and it seems fast enough. However, it does NOT do "multirow inserts". I don't think ActiveRecord is able to do that, specially is validations are involved.
Using your DB's native import mechanism will allways be faster, but you will lose some of AR's goodness - validations and filters, mostly.

Related

How to run complex queries in Tarantool

I've always worked with relational DBs and recently decided to migrate a performance-critial service from SQL Server to Tarantool with a hope to take advantage of the fast in-memory search and processing. I've got a couple of questions while planning for the migration.
I've got a table with about one million records containing pricing information which means I'm dealing mostly with numbers and uuids. First, I need to run a select containing multiple conditions to get a subset of the data, like
SELECT * FROM rates WHERE SupplierId = #SupplierId AND ProductId = #ProductId AND (LocalDistributionZoneId = #LocalDistributionZoneId OR LocalDistributionZoneId IS NULL)
Q1: What is the strategy of running such a query in Lua? Do I create an index for each field in the predicate or I can go along with one secondary composite index?
Q2: Will it be more covenient to run such a query in SQL (box.sql.execute) rather than in pure Lua? Will it be considerably slower than running the same query in pure Lua?
Q3: If I use SQL, is it possible to review the execusion plan to make sure that the query I run really uses the indexes I've defined in the space?
Ok, after I've get the results from the first query I need to analyse the data and then based on the results of analysis, run one more query on the dataset returned by the first query.
Q4: Can Tarantool help me in dealing with the intermediate dataset? More specifically, may I somehow run more queries against the intermediate subset of tuples leveraging the indexes created in the space? Or, I would need to implement alternative strategies like re-add the intrim results to a temporary space with pre-defined indexes and then do another select, or implement further search myself?
Thank you!
Don't. Use SQL, it's faster: it doesn't create garbage collected objects for intermediate execution results.
Yes, please use our SQL features for that.
Use EXPLAIN statement.
I don't know what you exactly mean by "help". You could try to whatever strategy works best: create a more complex query, save the original query in a view to use in the resulting query, create a temporary table and work with it. To give more details let's look if the execution plan Tarantool chooses is good enough or you have to manually optimize it.

Effects of Rail's default_scope on performance

Can default_scope when used to not order records by ID significantly slow down a Rails application?
For example, I have a Rails (currently 3.1) app using PostgreSQL where nearly every Model has a default_scope ordering records by their name:
default_scope order('users.name')
Right now because the default_scope's order records by name rather by ID, I am worried I might be incurring a significant performance penalty when normal queries are run. For example with:
User.find(5563401)
or
#User.where('created_at = ?', 2.weeks.ago)
or
User.some_scope_sorted_best_by_id.all
In the above examples, what performance penalty might I incur by having a default_scope by name on my Model? Should I be concerned about this default_scope affecting application performance?
Your question is missing the point. The default scope itself is just a few microseconds of Ruby execution to cause an order by clause to be added to every SQL statement sent to PostgreSQL.
So your question is really asking about the performance difference between unordered queries and ordered ones.
Postgresql documentation is pretty explicit. Ordered queries on unindexed fields are much slower than unordered because (no surprise), PostgreSQL must sort the results before returning them, first creating temporary table or index to contain the result. This could easily be a factor of 4 in query time, possibly much more.
If you introduce an index just to achieve quick ordering, you are still paying to maintain the index on every insert and update. And unless it's the primary index, sorted access still involves random seeks, which may actually be slower than creating a temporary table. This also is discussed in the Postgres docs.
In a nutshell, NEVER add an order clause to an SQL query that doesn't need it (unless you enjoy waiting for your database).
NB: I doubt a simple find() will have order by attached because it must return exactly one result. You can verify this very quickly by starting rails console, issuing a find, and watching the generated SQL scroll by. However, the where and all definitely will be ordered and consequently definitely be slower than needed.

What's the difference between “includes” and “preload” in an ActiveRecord query?

I'm struggling to find a comparison of includes() and preload() for ActiveRecord objects. Can anyone explain the difference ?
Rails has 2 ways of avoiding the n+1 problem. One involves creating a big join based query to pull in your associations, the other involves making a separate query per association.
When you do includes rails decides which strategy to use for you. It defaults to the separate query approach (preloading) unless it thinks you are using the columns from the associations in you conditions or order. Since that only works with the joins approach it uses that instead.
Rails' heuristics sometimes get it wrong or you may have a specific reason for preferring one approach over the other. preload ( and its companion method eager_load) allow you to specify which strategy you want rails to use.
As apidoc said "This method is deprecated or moved on the latest stable version. The last existing version (v3.0.9) is shown here." So the difference is that includes just NOT deprecated.

How to change model lazyloadness at runtime in Symfony?

I use sfPropelORMPlugin.
Lazyload is ok if I operate on one object per web page. But if there are hundreds I get hundreds of separate DB queries. I'd like to completely disable lazyload or disable it for needed columns on those particularly heavy pages but couldn't find a way so far.
You should join all your relations when you build your query, that way you'll get all data in a single query. Note, you have to use joinWithRelation() where Relation is a related table name.
Elaborating on William Durand's answer, perhaps you should also look at the Propel function doSelectjoinAll(), which should pre-load all of the objects related to your relations. Just keep in mind this can be expensive as it relates to memory.
Another technique is to create a custom criteria with your needed joins, then use a manual hydrate technique to add on to your base object. I do this often when the data I need is using aggregates or other columns that are not exactly mapped to objects. There are plenty of hydrate() examples around.
Added utility method to peer to be able to set what columns I want to load. Using "pseudo columns" for this type of DB queries. Also I have overridden hydrate() to understand this "markup". All were good until I found out that even though data is hydrated symfony won't understand it and won't let you use it as intended.
PS join was never considered as an option because site is kind of high load.

How to make an Active Record to Sequel transition

I'm using Rails3.rc and Active Record 3 (with meta_where) and just started to switch to Sequel, because it seems to be much faster and offers some really great features.
I'm already using the active_model plugin (and some others).
As far as I know, I should use User[params[:id]] instead of User.find(params[:id]). But this doesn't raise if no record exists and doesn't convert the value to an integer (type of PK), so it's as a string in the where clause. I'm not sure if this is causing any performance issues. Does this harm identity_map? What's the best way to solve both these issues?
Is there an easy way to flip the usage of associations like User.messages_dataset and User.messages so that User.messages behaves like in Active Record (User.messages_data_set). I guess I'd use the #..._dataset a lot but never need the array method, because I could just add .all?
I noticed some same (complex) queries are executed several times within one action sometimes. Is there something like the Active Record query cache? (identity_map doesn't seem to work for these cases).
Is there a to_sql I can call to get the raw SQL a dataset would produce?
You can either use:
User[params[:id].to_i] || raise Sequel::Error
Or write your own method that does something like that. Sequel supports non-integer primary keys, so it wouldn't do the conversion automatically. It shouldn't have any affect on an identity map. Note that Sequel doesn't use an identity map unless you are using the identity_map plugin. I guess the best way is to write your own helper method.
Not really. You can use the association_proxies plugin so that non-array methods are sent to the dataset instead of the array of objects. In general, you shouldn't be using the association dataset method much. If you are using it a lot, it's a sign that you should have an association for that specific usage.
There is and will never be a query cache. You should write your actions so that the results of the first query are cached and reused.
Dataset#sql gives you the SELECT SQL for the dataset.

Resources