Parsing large CSVs while searching Database - ruby-on-rails

Currently have a tricky problem and need ideas for most efficient way to go about solving.
We periodically iterate through large CSV files (~50000 to 2m rows), and for each row, we need to check a database table for matching columns.
So for example, each CSV row could have details about an Event - artist, venue, date/time etc, and for each row, we check our database (PG) for any rows that match the artist, venue and date/time the most, and then perform operations if any match is found.
Currently, the entire process is highly CPU, memory and time intensive pulling row by row, so we perform the matching in batches, but still seeking ideas for an efficient way to perform the comparison both memory-wise, and time-wise
Thanks.

Load the complete CSV file into a temporary table in your database (using a DB tool, see e.g. How to import CSV file data into a PostgreSQL table?)
Perform matching and operations in-database, i.e. in SQL
If necessary, truncate the temporary table afterwards
This would move most of the load into the DB server, avoiding all the ActiveRecord overhead (network traffic, result parsing, model instantiation etc.)

Related

Syncing data between CSV and DB

I need sync data between a CSV file and my DB, but is a very slow process when I check if each item exists.
For example, I have a very large list of postal codes, when this list load into the system, the app need check if this record exist in the database.
I try to use find_or_initialize_by, but is very slow when the list of postal codes has more than 100_000 records ... I also tried to cache all the records in the database and compare them using .select, but it is almost as slow as using the database.
Any suggestion?
Using find_or_initialize_by is extremely slow for use cases like these because this approach would run at least one query against the database for each line in the CSV file. And if the record wasn't found there will be a second insert query. Even if every single query is extremely fast, let's assume they only take 5ms, they will add up: With 100k lines in the CSV alone the find_or_initialize_by method calls will take over 8 minutes.
Therefore my approach would be to avoid doing many small database queries and instead do only a few, big queries and keep the data in memory.
First, load all records from the database but not the whole record but only the unique parts. For postal code data that might be the zip_code column. Then store that data in an in-memory data structure that allows fast lookup, for example in a Set.
require 'set'
existing_zip_codes = Set.new(
PostalCode.all.pluck(:zip_code)
)
Then iterate over the CSV and collect all data that need to be imported into the database.
missing_postal_codes = []
CSV.foreach(...) do |row|
next if existing_zip_codes.include?(row['zip_code'])
missing_postal_codes << {
zip_code: row['zip_code'],
city: row['city'],
# ...
}
end
And in the last step, I would insert all those missing data with one big insert_all call into the database.
PostalCode.insert_all(missing_postal_codes)

Strategies to speed up access to databases when working with columns containing massive amounts of data (spatial columns, etc)

First things first, I am an amateur, self-taught ruby programmer who came of age as a novice engineer in the age of super-fast computers where program efficiency was not an issue in the early stages of my primary GIS software development project. This technical debt is starting to tax my project and I want to speed up access to this lumbering GIS database.
Its a postgresql database with a postgis extension, controlled inside of rails, which immediately creates efficiency issues via the object-ification of database columns when accessing and/or manipulating database records with one or many columns containing text or spatial data easily in excess of 1 megabyte per column.
Its extremely slow now, and it didn't used to be like this.
One strategy: I'm considering building child tables of my large spatial data tables (state, county, census tract, etc) so that when I access the tables I don't have to load the massive spatial columns every time I access the objects. But then doing spatial queries might be difficult on a parent table's children. Not sure exactly how I would do that but I think its possible.
Maybe I have too many indexes. I have a lot of spatial indexes. Do additional spatial indexes from tables I'm not currently using slow down my queries? How about having too many for one table?
These tables have a massive amount of columns. Maybe I should remove some columns, or create parent tables for the columns with massive serialized hashes?
There are A LOT of tables I don't use anymore. Is there a reason other than tidiness to remove these unused tables? Are they slowing down my queries? Simply doing a #count method on some of these tables takes TIME.
PS:
- Looking back at this 8 hours later, I think what I'm equally trying to understand is how many of the above techniques are completely USELESS when it comes to optimizing (rails) database performance?
You don't have to read all of the columns of the table. Just read the ones you need.
You can:
MyObject.select(:id, :col1, :col2).where(...)
... and the omitted columns are not read.
If you try to use a method that needs one of the columns you've omitted then you'll get an ActiveModel::MissingAttributeError (Rails 4), but you presumably know when you're going to need them or not.
The inclusion of large data sets in the table is going to be a noticeable problem from the database side if you have full table scans, and then you might consider moving these data to other tables.
If you only use Rails to read and write the large data columns, and don't use PostgreSQL functions on them, you might be able to compress the data on write and decompress on read. Override the getter and setter methods by using write_attribute and read_attribute, compressing and decompressing (respectively of course) the data.
Indexing. If you are using postgres to store such large chucks of data in single fields consider storing it as Array, JSON or Hstore fields. If you index it using the gin index types so you can search effectively within a given field.

How to persist large amounts of data by reading from a CSV file

How to persist large amounts of data by reading from a CSV file (say 20 million rows).
This is running close to 1 1/2 days so far and has persisted only 10 million rows, how can I batch this so that it becomes faster and is there a possibility to run this in a parallel fashion.
I am using the code here to read the CSV, I would like to know if there is a better way to achieve this.
Refer: dealing with large CSV files (20G) in ruby
You can try to first split the file into several smaller files, then you will be able to process several files in parallel.
Probably for splinting the file it will be faster to user a tool like split
split -l 1000000 ./test.txt ./out-files-
Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:
INSERT INTO some_table
VALUES
(1,'data1'),
(2, 'data2')
For better performance you'll need to build the SQL statement yourself and execute it:
ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')
Since you would like to persist your data to MySQL for further processing, using Load Data Infile from MySQL would be faster. something like the following with your schema:
sql = "LOAD DATA LOCAL INFILE 'big_data.csv'
INTO TABLE tests
FIELDS TERMINATED BY ',' ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
(foo,foo1)"
con = ActiveRecord::Base.connection
con.execute(sql)
Key points:
If you use MySQL InnoDB engine, my advice is that always define a auto-increment PRIMARY KEY, InnoDB uses clustered index to store data in the table. A clustered index determines the physical order of data in a table.
refer: http://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/
Config your MySQL Server parameters, the most important ones are
(1) close mysql binlog
(2) innodb_buffer_pool_size.
(3) innodb_flush_log_at_trx_commit
(4) bulk_insert_buffer_size
You can read this: http://www.percona.com/blog/2013/09/20/innodb-performance-optimization-basics-updated/
You should use producer-consumer scenario.
Sorry for my poor English.

Are there any extensions to TADOQuery that include client indexes

Quick question (hopefully)
I have a large dataset (>100,000 records) that I would like to use as a lookup to determine existence or non-existence of multiple keys. The purpose of this is to find FK violations before trying to commit them to the database to try and avoid the resultant EDatabaseError messing up my transaction.
I had been using TClientDataSet/TDatasetProvider with the FindKey method, as this allowed a client-side index to be set up and was faster (2s to scan each key rather than 10s for ADO). However, moving to large datasets the population of the CDS is starting to take far more time than the local index is saving.
I see that I have a few options for alternatives:
client cursor with TADOQuery.locate method
ADO SELECT statements for each check (no client cache)
ADO SEEK method
Extend TADOQuery to mimic FindKey
The Locate method seems easiest and doesn't spam the server with the SELECT/SEEK methods. I like the idea of extending the TADOQuery, but was wondering whether anyone knew of any ready-made solutions for this rather than having to create my own?
I would create a temporary table in the database server. Insert all 100,000 records into this temp table. Do bulk inserts of say 3000 records at a time, to minimise round trips to the server. Then run select statements on this temp table to check for foreign key violations etc. If all okay, do an insert SQL from the temp table to the main table.

Checking for updated dimension data

I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings.
In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form).
There are currently 10,390 records but this figure is expected to grow.
I want to use Type 2 ETL whereby if a record has changed in the OLTP database, a new record is added to the DW.
What is the best way to scan through 10,000 records in the DW and then compare the results with the results in several tables contained in the OLTP?
I'm thinking of creating a "snapshot" using a temporary table of the OLTP data and then comparing the results row by row with the data in the Dimension table in the DW.
I'm using SQL Server 2005. This doesn't seem like the most efficient way. Are there alternatives?
Introduce LastUpdated into source system (OLTP) tables. This way you have less to extract using:
WHERE LastUpdated >= some_time_here
You seem to be using SQL server, so you may also try rowversion type (8 byte db-scope-unique counter)
When importing your data into the DW, use ETL tool (SSIS, Pentaho, Talend). They all have a componenet (block, transformation) to handle SCD2 (slowly changing dimension type 2). For SSIS example see here. The transformation does exactly what you are trying to do -- all that you have to do is specify which columns to monitor and what to do when it detects the change.
It sounds like you are approaching this sort of backwards. The typical way for performing ETL (Extract, Test, Load) is:
"Extract" data from your OLTP database
Compare ("Test") your extracted data against the dimensional data to determine if there are changes or whatever other validation needs to be performed
Insert the data ("Load") in to your dimension table.
Effectively, in step #1, you'll create a physical record via a query against the multiple tables in your OLTP database, then compare that resulting record against your dimensional data to determine if a modification was made. This is the standard way of doing things. In addition, 10000 rows is pretty insignificant as far as volume goes. Any RDBMS and ETL process should be able to process through that in a matter of no more than few seconds at most. I know SQL Server has DTS, although I'm not sure if the name has changed in more recent versions. That is the perfect tool for doing something like this.
Does you OLTP database have an audit trail?
If so, then you can query the audit trail for just the records that have been touched since the last ETL.

Resources