I'm trying to use the closure tree gem for modelling some (ordered) nested data.
The issue I'm having is that when I go to insert records into the database (mysql) it is taking about 7 seconds to insert the 200 children (well, 400 inserts).
I'm about to go down the route of doing a straight bulk insert / raw sql in order to speed things up, though this means ensuring that I've got the hierarchy calls etc. correctly.
If anyone has a strategy out there for doing bulk inserts of children with closure_tree I'd love to see it.
My call to closure_tree is : has_closure_tree order: 'position'
I have also tried setting ActiveRecord::Base.connection.execute "set autocommit = 0;" (makes no difference) and turning off advisory_lock (also makes no difference)
[edit] also tried wrapping in a transaction where I was adding the children, no joy either.
[edit] have opened an issue (which I hate doing, but I'm hoping there's a strategy I can follow for this)
Related
My program is dealing with a deep nested object.
Here is an illustration of this nested model :
ParentObject HasMany ChildObject1 ~ 30 records
ChildObject1 HasMany ChildObject2 ~ 40 records
ChildObject2 HasMany ChildObject3 ~ 15
records
ChildObject2 HasMany ChildObject4 ~ 10 records
To have an efficient app, I have decided to split the forms used to record this data (1 form per childObject1). I also use caching and then needs to update ChildObject1 'updated_at' field everytime the ChildObject2,3,4 are updated. For this reason every childObject 'belongs_to' relation has the 'touch' option set to true.
Then, with a small server, performance are not so bad (max 1s to save data).
But once everything is recorded, I also need to duplicate the parentObject with all is childObjects.
No problem to duplicate it and build the same parentObject but when I save the object, the transaction is very long.
I looked to the server log and I saw that objects are inserted one-by-one. I also saw that after each insert, the parent 'updated_at' field is updated (due to 'touch: true' option).
It results in 30000 inserts more 60000 updates, 90000 writing queries in the database (and each object can have 3 to 6 fields...)!
Normally, 'save' method is natively using ActiveRecord::Base.transaction.
Here it doesn't happened.
I tried to remove the 'touch: true' option, it's exactly the same, inserts are done one-by-one.
So my questions are :
I thought that transactions can be applied to nested object like explain here, Am I misunderstanding something ?
Is it an example of what shouldn't be done through ActiveRecord ?
Is it possible to only do one final update of parents object with 'touch:true' option ? (SOLVED : SEE ANSWER BELOW)
Normally, is it a big work to write 90000 rows in database at once ? Maybe the puma server or the pg DB are simply bad configured ?
Thanks by advance for your help. If there's no solution, I will automate this work by night...
I solved a first part of the problem with https://github.com/godaddy/activerecord-delay_touching
This gem delayed the "touch" update at the end of the batch. It's cleaner with this !
But I still have problems with the transactions. I still don't know if I can insert all the data in one single query for each table.
Can default_scope when used to not order records by ID significantly slow down a Rails application?
For example, I have a Rails (currently 3.1) app using PostgreSQL where nearly every Model has a default_scope ordering records by their name:
default_scope order('users.name')
Right now because the default_scope's order records by name rather by ID, I am worried I might be incurring a significant performance penalty when normal queries are run. For example with:
User.find(5563401)
or
#User.where('created_at = ?', 2.weeks.ago)
or
User.some_scope_sorted_best_by_id.all
In the above examples, what performance penalty might I incur by having a default_scope by name on my Model? Should I be concerned about this default_scope affecting application performance?
Your question is missing the point. The default scope itself is just a few microseconds of Ruby execution to cause an order by clause to be added to every SQL statement sent to PostgreSQL.
So your question is really asking about the performance difference between unordered queries and ordered ones.
Postgresql documentation is pretty explicit. Ordered queries on unindexed fields are much slower than unordered because (no surprise), PostgreSQL must sort the results before returning them, first creating temporary table or index to contain the result. This could easily be a factor of 4 in query time, possibly much more.
If you introduce an index just to achieve quick ordering, you are still paying to maintain the index on every insert and update. And unless it's the primary index, sorted access still involves random seeks, which may actually be slower than creating a temporary table. This also is discussed in the Postgres docs.
In a nutshell, NEVER add an order clause to an SQL query that doesn't need it (unless you enjoy waiting for your database).
NB: I doubt a simple find() will have order by attached because it must return exactly one result. You can verify this very quickly by starting rails console, issuing a find, and watching the generated SQL scroll by. However, the where and all definitely will be ordered and consequently definitely be slower than needed.
I've been going round in circles for a few days trying to solve a problem which I've also struggled with in the past. Essentially its an issue of understanding the best (or an efficient) way to perform multiple queries on a model as I'm regularly finding my pages are very slow to load.
Consider the situation where you have a model called Everything. Initially you perform a query which finds those records in Everything which match certain criteria
#chosenrecords = Everything.where('name LIKE ?', 'What I want').order('price ASC')
I want to remember the contents of #chosenrecords as I will present them to the user as a list, however, I would also like to understand more of the attributes of #chosenrecords,for instance
#minimumprice = #chosenrecords.first
#numberofrecords = #chosenrecords.count
When I use the above code in my controller and inspect the command history on the local server, I am surprised to find that each of the three queries involves an SQL query on the original Everything model, rather than remembering the records returned in #chosenrecords and performing the query on that. This seems very inefficient to me and indeed each of the three queries takes the same amount of time to process, making the page perform slowly.
I am more experienced in writing codes in software like MATLAB where once you've calculated the value of a variable it is stored locally and can be quickly interrogated, rather than recalculating that variable on each occasion you want to know more information about it. Please could you guide me as to whether I am just on the wrong track completely and the issues I've identified are just "how it is in Rails" or whether there is something I can do to improve it. I've looked into concepts like using a scope, defining a different variable type, and caching, but I'm not quite sure what I'm doing in each case and keep ending up in a similar hole.
Thanks for your time
You are partially on the wrong track. Rails 3 comes with Arel, which defer the query until data is required. In your case, you have generated Arel query but executing it with .first & then with .count. What I have done here is run the first query, get all the results in an array and working on that array in next two lines.
Perform the queries like this:-
#chosenrecords = Everything.where('name LIKE ?', 'What I want').order('price ASC').all
#minimumprice = #chosenrecords.first
#numberofrecords = #chosenrecords.size
It will solve your issue.
I have been working in Rails (I mean serious working) for last 1.5 years now. Coming from .Net background and database/OLAP development, there are many things I like about Rails but there are few things about it that just don't make sense to me. I just need some clarification for one such issue.
I have been working on an educational institute's admission process, which is just a small part of much bigger application. Now, for administrator, we needed to display list of all applied/enrolled students (which may range from 1000 to 10,000), and also give a way to export them as excel file. For now, we are just focusing on exporting in CSV format.
My questions are:
Is Rails meant to display so many records at the same time?
Is will_paginate only way to paginate records in Rails? From what I understand, it still fetches all the records from DB, and then selectively displays relevant records. Back in .Net/PHP/JSP, we used to create stored procedure and from there we selectively returns relevant records. Since, using stored procedure being a known issue in Rails, what other options do we have?
Same issue with exporting this data. I benchmarked the process i.e. receiving request at the server, execution of the query and response return. The ActiveRecord creation was taking a helluva time. Why was that? There were only like 1000 records, and the page showed connection timeout at the user. I mean, if connection times-out while working on for 1000 records, then why use Rails or it means Rails are not meant for such applications. I have previously worked with TB's of data, and never had this issue.
I never understood ORM techniques at the core. Say, we have a table users, and are associated with multiple other tables, but for displaying records, we need data from only tables users and its associated table admissions, then does it actually create objects for all its associated tables. I know, the data will be fetched only if we use the association, but does it create all the objects before-hand?
I hope, these questions are not independent and do qualify as per the guidelines of SF.
Thank you.
EDIT: Any help? I re-checked and benchmarked again, for 1000 records, where in we are joining 4-5 different tables (1000 users, 2-3 one-to-one association, and 2-3 one-to-many associations), it is creating more than 15000 objects. This is for eager loading. As for lazy loading, it will be 1000 user query plus some 20+ queries). What are other possible options for such problems and applications? I know, I am kinda bumping the question to come to top again!
Rails can handle databases with TBs of data.
Is will_paginate only way to paginate records in Rails?
There are many other gems like "kaminari".
it fetches all records from the db..
NO. It doesnt work that way. For example take the following query,Users.all.page(1).per(10)
User.all wont fire a db query, it will return a proxy object. And you call page(1) and per(10) on the proxy(ActiveRecord::Relation). When you try to access the data from the proxy object, it will execute a db query. Active record will accumulate all conditions and paramaters you pass and will execute a sql query when required.
Go to rails console and type u= User.all; "f"; ( the second statement: "f", is to prevent rails console from calling to_s on the proxy to display the result.)
It wont fire any query. Now try u[0], it will fire a query.
ActiveRecord creation was taking a helluva time
1000 records shouldn't take much time.
Check the number of sql queries fired from the db. Look for signs of
n+1 problem and fix them by eager loading.
Check the serialization of the records to csv format for any cpu or memory intensive operation.
Use a profiler and track down the function that is consuming most of the time.
I have 7000 objects in my Db4o database.
When i retrieve all of the objects it's almost instant..
When i add a where constrain ie Name = "Chris" it takes 6-8 seconds.
What's going on?
Also i've seen a couple of comments about using Lucene for search type of queries does anyone have any good links for this?
There are two things to check.
Have you added the 'Db4objects.Db4o.NativeQueries'-assembly? Without this assembly, a native query cannot be optimized.
Have set an index on the field which represents the Name? A index should make query a lot faster
Index:
cfg.Common.ObjectClass(typeof(YourObject)).ObjectField("fieldName").Indexed(true);
This question is kinda old, but perhaps this is of any use:
When using native queries, try to set a breakpoint on the lambda expression. If the breakpoint is actually invoked, you're in trouble because the optimization failed. To invoke the lambda, each of the objects will have to be instantiated which is very costly.
If optimization worked, the lambda expression tree will be analyzed and the actual code won't be needed, thus breakpoints won't be triggered.
Also note that settings indexes on fields must be performed before opening the connection.
Last, I have a test case of simple objects. When I started without query optimization and indexing (and worse, using a server that was forced to use the GenericReflector because I failed to provide the model .dlls), it too 600s for a three-criteria query on about 100,000 objects. Now it takes 6s for the same query on 2.5M objects so there is really a HUGE gain.