My program is dealing with a deep nested object.
Here is an illustration of this nested model :
ParentObject HasMany ChildObject1 ~ 30 records
ChildObject1 HasMany ChildObject2 ~ 40 records
ChildObject2 HasMany ChildObject3 ~ 15
records
ChildObject2 HasMany ChildObject4 ~ 10 records
To have an efficient app, I have decided to split the forms used to record this data (1 form per childObject1). I also use caching and then needs to update ChildObject1 'updated_at' field everytime the ChildObject2,3,4 are updated. For this reason every childObject 'belongs_to' relation has the 'touch' option set to true.
Then, with a small server, performance are not so bad (max 1s to save data).
But once everything is recorded, I also need to duplicate the parentObject with all is childObjects.
No problem to duplicate it and build the same parentObject but when I save the object, the transaction is very long.
I looked to the server log and I saw that objects are inserted one-by-one. I also saw that after each insert, the parent 'updated_at' field is updated (due to 'touch: true' option).
It results in 30000 inserts more 60000 updates, 90000 writing queries in the database (and each object can have 3 to 6 fields...)!
Normally, 'save' method is natively using ActiveRecord::Base.transaction.
Here it doesn't happened.
I tried to remove the 'touch: true' option, it's exactly the same, inserts are done one-by-one.
So my questions are :
I thought that transactions can be applied to nested object like explain here, Am I misunderstanding something ?
Is it an example of what shouldn't be done through ActiveRecord ?
Is it possible to only do one final update of parents object with 'touch:true' option ? (SOLVED : SEE ANSWER BELOW)
Normally, is it a big work to write 90000 rows in database at once ? Maybe the puma server or the pg DB are simply bad configured ?
Thanks by advance for your help. If there's no solution, I will automate this work by night...
I solved a first part of the problem with https://github.com/godaddy/activerecord-delay_touching
This gem delayed the "touch" update at the end of the batch. It's cleaner with this !
But I still have problems with the transactions. I still don't know if I can insert all the data in one single query for each table.
Related
I'm trying to use the closure tree gem for modelling some (ordered) nested data.
The issue I'm having is that when I go to insert records into the database (mysql) it is taking about 7 seconds to insert the 200 children (well, 400 inserts).
I'm about to go down the route of doing a straight bulk insert / raw sql in order to speed things up, though this means ensuring that I've got the hierarchy calls etc. correctly.
If anyone has a strategy out there for doing bulk inserts of children with closure_tree I'd love to see it.
My call to closure_tree is : has_closure_tree order: 'position'
I have also tried setting ActiveRecord::Base.connection.execute "set autocommit = 0;" (makes no difference) and turning off advisory_lock (also makes no difference)
[edit] also tried wrapping in a transaction where I was adding the children, no joy either.
[edit] have opened an issue (which I hate doing, but I'm hoping there's a strategy I can follow for this)
My app needs to import hundreds to thousands of records at a time. The records are each nodes in a tree structure. I'm using activerecord-import to significantly speed up the import, and haven't yet settled on which of ancestry, closure_tree, acts_as_list or a custom solution to use for setting out the hierarchy.
The problem I'm grappling with is how to import all the data and relationships in one or just a few passes. My draft naive solution is:
creating each object in memory, and manually giving each object an id;
using those ids to manually giving each object the foreign_key that it needs (eg parent_id); and then
mass-importing the resulting array of objects using activerecord-import
This feels like a hack with obvious problems. For example, if the ids that I've chosen for my objects get used by the database while I'm still instantiating my objects, then the relationships I've manually set become useless/wrong.
Another major problem is that as I look into more advanced solutions for the tree data structure (eg closure_tree and ancestry), manually setting the fields required by those gems feels more and more like a hack.
So I guess my question is, is there a clean way to set up a tree structure of N nodes in a rails activerecord database while touching the database less than N times?
The master branch has a commit with this functionality. It will work only with Rails 4 and Postgres.
If you happen to have another configuration, you will need to:
Create a new array of Models/hashes
Model.import it
Retrieve the IDs of the rows you just inserted
Goto 1, setting in the new array the IDs (parent_id) you just retrieved
I have a database with a few million entities needing a friendly_id slug. Is there a way to speed up the process of saving the entities? find_each(&:save) is very slow. At 6-10 per second I'm looking at over a week of running this 24/7.
I'm just wondering if there is a method within friendly_id or parallel processing trick that can speed this process up drastically.
Currently I'm running about 10 consoles, and within each console starting the value +100k:
Model.where(slug: nil).find_each(start: value) do |e|
puts e.id
e.save
end
EDIT
Well one of the biggest things that was causing the updates to go so insanely slow is the initial find query of the entity, and not the actual saving of the record. I put the site live the other day, and looking at server database requests continually hitting 2000ms and the culprit was #entity = Entity.find(params[:id]) causing the most problems with 5+ million records. I didn't realize there was no index on the slug column and active record is doing its SELECT statements on the slug column. After indexing properly, I get 20ms response times and running the above query went from 1-2 entities per second to 1k per second. Doing multiple of them got the job down quick enough for the one time operation.
I think the fastest way to do this would be to go straight to the database, rather than using Active Record. If you have a GUI like Sequel Pro, connect to your database (the details are in your database.yml) and execute a query there. If you're comfortable on the command line you can run it straight in the database console window. Ruby and Active Record will just slow you down for something like this.
To update all the slugs of a hypothetical table called "users" where the slug will be a concatenation of their first name and last name you could do something like this in MySQL:
UPDATE users SET slug = CONCAT(first_name, "-", last_name) WHERE slug IS NULL
I have been working in Rails (I mean serious working) for last 1.5 years now. Coming from .Net background and database/OLAP development, there are many things I like about Rails but there are few things about it that just don't make sense to me. I just need some clarification for one such issue.
I have been working on an educational institute's admission process, which is just a small part of much bigger application. Now, for administrator, we needed to display list of all applied/enrolled students (which may range from 1000 to 10,000), and also give a way to export them as excel file. For now, we are just focusing on exporting in CSV format.
My questions are:
Is Rails meant to display so many records at the same time?
Is will_paginate only way to paginate records in Rails? From what I understand, it still fetches all the records from DB, and then selectively displays relevant records. Back in .Net/PHP/JSP, we used to create stored procedure and from there we selectively returns relevant records. Since, using stored procedure being a known issue in Rails, what other options do we have?
Same issue with exporting this data. I benchmarked the process i.e. receiving request at the server, execution of the query and response return. The ActiveRecord creation was taking a helluva time. Why was that? There were only like 1000 records, and the page showed connection timeout at the user. I mean, if connection times-out while working on for 1000 records, then why use Rails or it means Rails are not meant for such applications. I have previously worked with TB's of data, and never had this issue.
I never understood ORM techniques at the core. Say, we have a table users, and are associated with multiple other tables, but for displaying records, we need data from only tables users and its associated table admissions, then does it actually create objects for all its associated tables. I know, the data will be fetched only if we use the association, but does it create all the objects before-hand?
I hope, these questions are not independent and do qualify as per the guidelines of SF.
Thank you.
EDIT: Any help? I re-checked and benchmarked again, for 1000 records, where in we are joining 4-5 different tables (1000 users, 2-3 one-to-one association, and 2-3 one-to-many associations), it is creating more than 15000 objects. This is for eager loading. As for lazy loading, it will be 1000 user query plus some 20+ queries). What are other possible options for such problems and applications? I know, I am kinda bumping the question to come to top again!
Rails can handle databases with TBs of data.
Is will_paginate only way to paginate records in Rails?
There are many other gems like "kaminari".
it fetches all records from the db..
NO. It doesnt work that way. For example take the following query,Users.all.page(1).per(10)
User.all wont fire a db query, it will return a proxy object. And you call page(1) and per(10) on the proxy(ActiveRecord::Relation). When you try to access the data from the proxy object, it will execute a db query. Active record will accumulate all conditions and paramaters you pass and will execute a sql query when required.
Go to rails console and type u= User.all; "f"; ( the second statement: "f", is to prevent rails console from calling to_s on the proxy to display the result.)
It wont fire any query. Now try u[0], it will fire a query.
ActiveRecord creation was taking a helluva time
1000 records shouldn't take much time.
Check the number of sql queries fired from the db. Look for signs of
n+1 problem and fix them by eager loading.
Check the serialization of the records to csv format for any cpu or memory intensive operation.
Use a profiler and track down the function that is consuming most of the time.
There is some code in the project I'm working on where a dynamic finder behaves differently in one code branch than it does in another.
This line of code returns all my advertisers (there are 8 of them), regardless of which branch I'm in.
Advertiser.findAllByOwner(ownerInstance)
But when I start adding conditions, things get weird. In branch A, the following code returns all of my advertisers:
Advertiser.findAllByOwner(ownerInstance, [max: 25])
But in branch B, that code only returns 1 advertiser.
It doesn't seem possible that changes in application code could affect how a dynamic finder works. Am I wrong? Is there anything else that might cause this not to work?
Edit
I've been asked to post my class definitions. Instead of posting all of it, I'm going to post what I think is the important part:
static mapping = {
owner fetch: 'join'
category fetch: 'join'
subcategory fetch: 'join'
}
static fetchMode = [
grades: 'eager',
advertiserKeywords: 'eager',
advertiserConnections: 'eager'
]
This code was present in branch B but absent from branch A. When I pull it out, things now work as expected.
I decided to do some more digging with this code present to see what I could observe. I found something interesting when I used withCriteria instead of the dynamic finder:
Advertiser.withCriteria{owner{idEq(ownerInstance.id)}}
What I found was that this returned thousands of duplicates! So I tried using listDistinct:
Adviertiser.createCriteria().listDistinct{owner{idEq(ownerInstance.id)}}
Now this returns all 8 of my advertisers with no duplicates. But what if I try to limit the results?
Advertiser.createCriteria().listDistinct{
owner{idEq(ownerInstance.id)}
maxResults 25
}
Now this returns a single result, just like my dynamic finder does. When I cranked maxResults upto 100K, now I get all 8 of my results.
So what's happening? It seems that the joins or the eager fetching (or both) generated sql that returned thousands of duplicates. Grails dynamic finders must return distinct results by default, so when I wasn't limiting them, I didn't notice anything strange. But once I set a limit, since the records were ordered by ID, the first 25 records would all be duplicate records, meaning that only one distinct record will be returned.
As for the joins and eager fetching, I don't know what problem that code was trying to solve, so I can't say whether or not it's necessary; the question is, why does having this code in my class generate so many duplicates?
I found out that the eager fetching was added (many levels deep) in order to speed up the rendering of certain reports, because hundreds of queries were being made. Attempts were made to eager fetch on demand, but other developers had difficulty going more than one level deep using finders or Grails criteria.
So the general answer to the question above is: instead of eager by default, which can cause huge nightmares in other places, we need to find a way to do eager fetching on a single query that can go more than one level down the tree
The next question is, how? It's not very well supported in Grails, but it can be achieved by simply using Hibernate's Criteria class. Here's the gist of it:
def advertiser = Advertiser.createCriteria()
.add(Restrictions.eq('id', advertiserId))
.createCriteria('advertiserConnections', CriteriaSpecification.INNER_JOIN)
.setFetchMode('serpEntries', FetchMode.JOIN)
.uniqueResult()
Now the advertiser's advertiserConnections, will be eager fetched, and the advertiserConnections' serpEntries will also be eager fetched. You can go as far down the tree as you need to. Then you can leave your classes lazy by default - which they definitely should be for hasMany scenarios.
Since your query are retrieving duplicates, there's a chance that this limit of 25 records return the same data, consequently your distinct will reduce to one record.
Try to define the equals() and hashCode() to your classes, specially if you have some with composite primary key, or is used as hasMany.
I also suggest you to try to eliminate the possibilities. Remove the fetch and the eager one by one to see how it affects your result data (without limit).