How to drop DB indices before bulk loading in seeds.rb? - ruby-on-rails

In my rails app I have a seeds.rb script which inserts lots of records. In fact I'm trying to load 16 million of them. It's taking a long time.
One thing I wanted to try to speed this up, is to drop the table indices and re-add them afterwards. If it sounds like I'm doing something insane, please let me know, but that seems to be one recommendation for bulk loading into postgres
I use add_index and remove_index commands in migrations, but the same syntax doesn't work in a seeds.rb file. Is it possible to do this outside a migration in fact? (I'm imagining it might not be best practice, because it represents a schema change)
rails v2.3.8,
postgres v8.4.8

One possibility is just to indulge in a little raw SQL within seeds.rb
ActiveRecord::Base.connection.execute("DROP INDEX myindex ON mytable")
At 16 million records, I would recommend managing the whole thing via raw SQL (contained within seeds.rb if you like). Do all 16 million records go into a single table? There ought to be some PostgreSQL magic to bulk import a file (in a PostgreSQL specific format) into a table.

Related

Rails Postgres. Remove indexes before loading data, then re-add them... tidily

We have an importing process which involves loading data from XML files into the database tables of our rails app. We remove the indexes from our tables and then re-add them when loading is complete. This seems to be the recommendation (and seems to be necessary for us) to speed up the loading process. But what is a good tidy way of doing this, which avoids evil duplication? Our current approach feels... kind of clever, but at the same time horribly hacky:
We have 6 different database tables which have total of 42 different indexes defined on them. These are all defined in the Rails migrations and schema.rb, and we want to be able to make changes to these, avoiding duplicating schema definitions elsewhere so...
Current approach:
Before entering the loading logic we have this little bit of black magic:
indexdefs = []
import_tables.each do |table_name|
res = #conn.exec("SELECT indexname, indexdef FROM pg_indexes WHERE tablename='#{table_name}'");
res.each do |row|
indexdefs << row['indexdef']
#conn.exec("DROP index #{row['indexname']}")
end
end
logger.info "#{indexdefs.size} indexes dropped"
Then we load data into the tables (takes a long time), and then...
indexdefs.each do |indexdef|
logger.info "Re-adding index: #{indexdef}"
#conn.exec(indexdef)
end
As mentioned above, the key thing we achieve with this, is no explicit duplication of any knowledge about index definitions/schemas. We query pg_indexes (Postgres internal schema tables) to get details of the index definitions, as they were set-up by migrations, and then we store an array of strings, indexdefs, of SQL CREATE statements which we run later.
...so good and yet so bad.
What's a better way of doing this? Maybe there's frameworks/gems I should be using, or completely different approaches to the problem.

Strategies to speed up access to databases when working with columns containing massive amounts of data (spatial columns, etc)

First things first, I am an amateur, self-taught ruby programmer who came of age as a novice engineer in the age of super-fast computers where program efficiency was not an issue in the early stages of my primary GIS software development project. This technical debt is starting to tax my project and I want to speed up access to this lumbering GIS database.
Its a postgresql database with a postgis extension, controlled inside of rails, which immediately creates efficiency issues via the object-ification of database columns when accessing and/or manipulating database records with one or many columns containing text or spatial data easily in excess of 1 megabyte per column.
Its extremely slow now, and it didn't used to be like this.
One strategy: I'm considering building child tables of my large spatial data tables (state, county, census tract, etc) so that when I access the tables I don't have to load the massive spatial columns every time I access the objects. But then doing spatial queries might be difficult on a parent table's children. Not sure exactly how I would do that but I think its possible.
Maybe I have too many indexes. I have a lot of spatial indexes. Do additional spatial indexes from tables I'm not currently using slow down my queries? How about having too many for one table?
These tables have a massive amount of columns. Maybe I should remove some columns, or create parent tables for the columns with massive serialized hashes?
There are A LOT of tables I don't use anymore. Is there a reason other than tidiness to remove these unused tables? Are they slowing down my queries? Simply doing a #count method on some of these tables takes TIME.
PS:
- Looking back at this 8 hours later, I think what I'm equally trying to understand is how many of the above techniques are completely USELESS when it comes to optimizing (rails) database performance?
You don't have to read all of the columns of the table. Just read the ones you need.
You can:
MyObject.select(:id, :col1, :col2).where(...)
... and the omitted columns are not read.
If you try to use a method that needs one of the columns you've omitted then you'll get an ActiveModel::MissingAttributeError (Rails 4), but you presumably know when you're going to need them or not.
The inclusion of large data sets in the table is going to be a noticeable problem from the database side if you have full table scans, and then you might consider moving these data to other tables.
If you only use Rails to read and write the large data columns, and don't use PostgreSQL functions on them, you might be able to compress the data on write and decompress on read. Override the getter and setter methods by using write_attribute and read_attribute, compressing and decompressing (respectively of course) the data.
Indexing. If you are using postgres to store such large chucks of data in single fields consider storing it as Array, JSON or Hstore fields. If you index it using the gin index types so you can search effectively within a given field.

How to persist large amounts of data by reading from a CSV file

How to persist large amounts of data by reading from a CSV file (say 20 million rows).
This is running close to 1 1/2 days so far and has persisted only 10 million rows, how can I batch this so that it becomes faster and is there a possibility to run this in a parallel fashion.
I am using the code here to read the CSV, I would like to know if there is a better way to achieve this.
Refer: dealing with large CSV files (20G) in ruby
You can try to first split the file into several smaller files, then you will be able to process several files in parallel.
Probably for splinting the file it will be faster to user a tool like split
split -l 1000000 ./test.txt ./out-files-
Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:
INSERT INTO some_table
VALUES
(1,'data1'),
(2, 'data2')
For better performance you'll need to build the SQL statement yourself and execute it:
ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')
Since you would like to persist your data to MySQL for further processing, using Load Data Infile from MySQL would be faster. something like the following with your schema:
sql = "LOAD DATA LOCAL INFILE 'big_data.csv'
INTO TABLE tests
FIELDS TERMINATED BY ',' ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
(foo,foo1)"
con = ActiveRecord::Base.connection
con.execute(sql)
Key points:
If you use MySQL InnoDB engine, my advice is that always define a auto-increment PRIMARY KEY, InnoDB uses clustered index to store data in the table. A clustered index determines the physical order of data in a table.
refer: http://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/
Config your MySQL Server parameters, the most important ones are
(1) close mysql binlog
(2) innodb_buffer_pool_size.
(3) innodb_flush_log_at_trx_commit
(4) bulk_insert_buffer_size
You can read this: http://www.percona.com/blog/2013/09/20/innodb-performance-optimization-basics-updated/
You should use producer-consumer scenario.
Sorry for my poor English.

Importing huge excel file to Rails application

I have an excel file with thousands of rows. In my case, I can't use bulk insert, because for each row I should create few associations. Now, all process take more than 1 hour with 20k rows, which is hell. What is the best way to resolve this problem?
I'm using spreadsheet gem.
This is analogous to the infamous "1+N" query situation that Rails loves to encounter. I have a similar situation (importing files of 20k+ rows with multiple associations). The way I optimized this process was to pre-load hashes for the associations. So for example, if you have an AssociatedModel that contains a lookup_column that is in your import data, you would first build a hash:
associated_model_hash = Hash.new(:not_found)
AssociatedModel.each do |item|
associated_model_hash[item.lookup_column] = item
end
This provides a hash of objects. You can repeat for as many associations as you have. In your import loop:
associated_model = associated_model_hash[row[:lookup_column]]
new_item.associated_model_id = associated_model.id
Because you don't have to do a search on the database each time, this is much faster. It should also allow you to use bulk insert (assuming you can guarantee that the associated models will not be deleted or modified in a bad way during the load).

Ruby on Rails migrations are very slow

I have SQLite3 database, which is populated with some large set of data.
I use migration for that.
3 tables will have following count of records:
Table_1 will have about 10 records
each record of Table_1 will be associated with ~100 records in Table_2
each record of Table_2 will be associated with ~2000 records in Table_3
The count of records will be about 10*100*2000 = 2000000
This takes a long time... Event, if i populate my database with about 20000 records, it takes about 10 minutes.
Also, i have noticed, that, during migration execution, ruby interpreter takes just 5% from CPU time and 95% remains unused ...
What the reason of such pure performance ?
Quite simply, inserting large amounts of records through manually saving AR objects one at a time is going to take years.
The best compromise between speed and "cleanness" (i.e. not a complete dodgy hack) for inserting large amounts of data is ar-extensions's (http://github.com/zdennis/ar-extensions) import method. It's not ideal, but it's better than any of the alternatives I could find, and the syntax is clean and doesn't require you to drop to raw sql (or anywhere close).
Example syntax:
items = Array.new
1.upto(200) do |n|
items << Item.new :some_field => n
end
Item.import items, :validate => false
At least in mysql this will batch the records into a single INSERT statement with multiple sets of values. Pretty damn fast.
If you run each INSERT statement in it's own transaction, SQLite can be very, very slow. But if you run it all in one transaction (or a logical set of transactions), then it can be very fast.
Seed_fu could help, as discussed in this question

Resources