Importing huge excel file to Rails application - ruby-on-rails

I have an excel file with thousands of rows. In my case, I can't use bulk insert, because for each row I should create few associations. Now, all process take more than 1 hour with 20k rows, which is hell. What is the best way to resolve this problem?
I'm using spreadsheet gem.

This is analogous to the infamous "1+N" query situation that Rails loves to encounter. I have a similar situation (importing files of 20k+ rows with multiple associations). The way I optimized this process was to pre-load hashes for the associations. So for example, if you have an AssociatedModel that contains a lookup_column that is in your import data, you would first build a hash:
associated_model_hash = Hash.new(:not_found)
AssociatedModel.each do |item|
associated_model_hash[item.lookup_column] = item
end
This provides a hash of objects. You can repeat for as many associations as you have. In your import loop:
associated_model = associated_model_hash[row[:lookup_column]]
new_item.associated_model_id = associated_model.id
Because you don't have to do a search on the database each time, this is much faster. It should also allow you to use bulk insert (assuming you can guarantee that the associated models will not be deleted or modified in a bad way during the load).

Related

Add relationships to existing data in Neo4j

To start with Neo4j (4.2.3) I loaded a year's worth of flights data (7m rows) and wanted to try and model a flight as a relationship between origin and destination airport. However the following query just eats up memory and has not finished after two days, so something is clearly amiss:
MATCH (f:Flight), (dest:Airport), (orig:Airport)
WHERE f.Dest = dest.IATA_Code AND f.Origin = orig.IATA_Code
CREATE (orig)-[r:FlightTo {DeptDateTime:f.DepDT, ArriveDateTime:f.ArrDT, Flight:f.Name}]->(dest)
I can do this instead:
LOAD CSV WITH HEADERS FROM 'file:///flights.csv' AS row
MERGE (o:Org_Airport {Org_IATA:row.Origin})
MERGE (d:Dest_Airport {Dest_IATA:row.Dest})
CREATE (o)-[r:FlightTo {DeptDateTime:row.DepDT, ArriveDateTime:row.ArrDT, Flight:row.Name}]->(d)
While this has the advantage of working (even in a reasonable time) it feels ugly to essentially duplicate the airports and also to go through the CSV file again when all the required data is already in the database.
I'm not quite there with my graph thinking probably so I'd appreciate some guidance on what the best way is to add a relationship like this, keeping in mind that original load files might get lost.
Do you have indexes set? Looking at your first query, you'd need:
CREATE INDEX ON :Flight(Dest);
CREATE INDEX ON :Airport(IATA_Code);
If you don't have indexes/constraints set on the label/property, the look up/merge will be very slow.

Import CSV data faster in rails

I am building an import module to import a large set of orders from a csv file. I have a model called Order where the data needs to be stored.
A simplified version of the Order model is below
sku
quantity
value
customer_email
order_date
status
When importing the data two things have to happen
Any dates or currencies need to be cleaned up i.e. dates are represented as strings in the csv, this needs to be converted into a Rails Date object and currencies need to converted to a decimal by removing any commas or dollar signs
If a row already exists it has to be updated, the uniqueness is checked based on two columns.
Currently I use a simple csv import code
CSV.foreach("orders.csv") do |row|
order = Order.first_or_initialize(sku: row[0], customer_email: row[3])
order.quantity = row[1]
order.value= parse_currency(row[2])
order.order_date = parse_date(row[4])
order.status = row[5]
order.save!
end
Where parse_currency and parse_date are two functions used to extract the values from strings. In the case of the date it is just a wrapper for Date.strptime.
I can add a check to see if the record already exists and do nothing in case it already exists and that should save a little time. But I am looking for something that is significantly faster. Currently importing around 100k rows takes around 30 mins with an empty database. It will get slower as the data size increases.
So I am basically looking for a faster way to import the data.
Any help would be appreciated.
Edit
After some more testing based on the comments here I have an observation and a question. I am not sure if they should go here or if I need to open a new thread for the questions. So please let me know if I have to move this to a separate question.
I ran a test using Postgres copy to import the data from the file and it took less than a minute. I just imported the data into a new table without any validations. So the import can be much faster.
The Rails overhead seems to be coming from 2 places
The multiple database calls that are happening i.e. the first_or_initialize for each row. This ends up becoming multiple SQL calls because it has to first find the record and then update it and then save it.
Bandwidth. Each time the SQL server is called the data flows back and forth which adds up to a lot of time
Now for my question. How do I move the update/create logic to the database i.e. If an order already exists based on the sku and customer_email it needs to update the record else a new record needs to be created. Currently with rails I am using the first_or_initialize method to get the record in case it exists and update it, else I am creating a new one and saving it. How do I do that in SQL.
I could run a raw SQL query using ActiveRecord connection execute but I do not think that would be a very elegant way of doing it. Is there a better way of doing that?
Since ruby 1.9 fastcsv is now part of ruby core. You don't need to use a special gem. Simply use CSV.
With 100k records ruby takes 0.018 secs / record. In my opinion most of your time will be used within Order.first_or_initialize. This part of your code takes an extra roundtrip to your database. Initialization of an ActiveRecord takes it time too. But to realy be sure I would suggest that you benchmark your code.
Benchmark.bm do |x|
x.report("CSV evel") { CSV.foreach("orders.csv") {} }
x.report("Init: ") { 1.upto(100_000) {Order.first_or_initialize(sku: rand(...), customer_email: rand(...))} } # use rand query to prevent query caching
x.report('parse_currency') { 1.upto(100_000) { parse_currency(...} }
x.report('parse_date') { 1.upto(100_000) { parse_date(...} }
end
You should also watch memory consumption during your import. Maybe the garbage collection does not run often enough or objects are not cleaned up.
To gain speed you can follow Matt Brictson hint and bypass ActiveRecord.
You can try the gem activerecord-import or you can start to go parallel, for instance multiprocessing with fork or multithreading with Thread.new.

Strategies to speed up access to databases when working with columns containing massive amounts of data (spatial columns, etc)

First things first, I am an amateur, self-taught ruby programmer who came of age as a novice engineer in the age of super-fast computers where program efficiency was not an issue in the early stages of my primary GIS software development project. This technical debt is starting to tax my project and I want to speed up access to this lumbering GIS database.
Its a postgresql database with a postgis extension, controlled inside of rails, which immediately creates efficiency issues via the object-ification of database columns when accessing and/or manipulating database records with one or many columns containing text or spatial data easily in excess of 1 megabyte per column.
Its extremely slow now, and it didn't used to be like this.
One strategy: I'm considering building child tables of my large spatial data tables (state, county, census tract, etc) so that when I access the tables I don't have to load the massive spatial columns every time I access the objects. But then doing spatial queries might be difficult on a parent table's children. Not sure exactly how I would do that but I think its possible.
Maybe I have too many indexes. I have a lot of spatial indexes. Do additional spatial indexes from tables I'm not currently using slow down my queries? How about having too many for one table?
These tables have a massive amount of columns. Maybe I should remove some columns, or create parent tables for the columns with massive serialized hashes?
There are A LOT of tables I don't use anymore. Is there a reason other than tidiness to remove these unused tables? Are they slowing down my queries? Simply doing a #count method on some of these tables takes TIME.
PS:
- Looking back at this 8 hours later, I think what I'm equally trying to understand is how many of the above techniques are completely USELESS when it comes to optimizing (rails) database performance?
You don't have to read all of the columns of the table. Just read the ones you need.
You can:
MyObject.select(:id, :col1, :col2).where(...)
... and the omitted columns are not read.
If you try to use a method that needs one of the columns you've omitted then you'll get an ActiveModel::MissingAttributeError (Rails 4), but you presumably know when you're going to need them or not.
The inclusion of large data sets in the table is going to be a noticeable problem from the database side if you have full table scans, and then you might consider moving these data to other tables.
If you only use Rails to read and write the large data columns, and don't use PostgreSQL functions on them, you might be able to compress the data on write and decompress on read. Override the getter and setter methods by using write_attribute and read_attribute, compressing and decompressing (respectively of course) the data.
Indexing. If you are using postgres to store such large chucks of data in single fields consider storing it as Array, JSON or Hstore fields. If you index it using the gin index types so you can search effectively within a given field.

Exporting and/or displaying multiple records in Rails

I have been working in Rails (I mean serious working) for last 1.5 years now. Coming from .Net background and database/OLAP development, there are many things I like about Rails but there are few things about it that just don't make sense to me. I just need some clarification for one such issue.
I have been working on an educational institute's admission process, which is just a small part of much bigger application. Now, for administrator, we needed to display list of all applied/enrolled students (which may range from 1000 to 10,000), and also give a way to export them as excel file. For now, we are just focusing on exporting in CSV format.
My questions are:
Is Rails meant to display so many records at the same time?
Is will_paginate only way to paginate records in Rails? From what I understand, it still fetches all the records from DB, and then selectively displays relevant records. Back in .Net/PHP/JSP, we used to create stored procedure and from there we selectively returns relevant records. Since, using stored procedure being a known issue in Rails, what other options do we have?
Same issue with exporting this data. I benchmarked the process i.e. receiving request at the server, execution of the query and response return. The ActiveRecord creation was taking a helluva time. Why was that? There were only like 1000 records, and the page showed connection timeout at the user. I mean, if connection times-out while working on for 1000 records, then why use Rails or it means Rails are not meant for such applications. I have previously worked with TB's of data, and never had this issue.
I never understood ORM techniques at the core. Say, we have a table users, and are associated with multiple other tables, but for displaying records, we need data from only tables users and its associated table admissions, then does it actually create objects for all its associated tables. I know, the data will be fetched only if we use the association, but does it create all the objects before-hand?
I hope, these questions are not independent and do qualify as per the guidelines of SF.
Thank you.
EDIT: Any help? I re-checked and benchmarked again, for 1000 records, where in we are joining 4-5 different tables (1000 users, 2-3 one-to-one association, and 2-3 one-to-many associations), it is creating more than 15000 objects. This is for eager loading. As for lazy loading, it will be 1000 user query plus some 20+ queries). What are other possible options for such problems and applications? I know, I am kinda bumping the question to come to top again!
Rails can handle databases with TBs of data.
Is will_paginate only way to paginate records in Rails?
There are many other gems like "kaminari".
it fetches all records from the db..
NO. It doesnt work that way. For example take the following query,Users.all.page(1).per(10)
User.all wont fire a db query, it will return a proxy object. And you call page(1) and per(10) on the proxy(ActiveRecord::Relation). When you try to access the data from the proxy object, it will execute a db query. Active record will accumulate all conditions and paramaters you pass and will execute a sql query when required.
Go to rails console and type u= User.all; "f"; ( the second statement: "f", is to prevent rails console from calling to_s on the proxy to display the result.)
It wont fire any query. Now try u[0], it will fire a query.
ActiveRecord creation was taking a helluva time
1000 records shouldn't take much time.
Check the number of sql queries fired from the db. Look for signs of
n+1 problem and fix them by eager loading.
Check the serialization of the records to csv format for any cpu or memory intensive operation.
Use a profiler and track down the function that is consuming most of the time.

Handling lots of report / financial data in rails 3, without slowing down?

I'm trying to figure out how to ask this - so I'll update the question as it goes to clear things up if needed.
I have a virtual stock exchange game site I've been building for fun. People make tons of trades, and each trade is its own record in a table.
When showing the portfolio page, I have to calculate everything on the fly, on the table of data - i.e. How many shares the user has, total gains, losses etc.
Things have really started slowing down, when I try to segment it by trades by company by day.
I don't really have any code to show to demonstrate this - but it just feels like I'm not approaching this correctly.
UPDATE: This code in particular is very slow
#Returning an array of values for a total portfolio over time
def portfolio_value_over_time
portfolio_value_array = []
days = self.from_the_first_funding_date
companies = self.list_of_companies
days.each_with_index do |day, index|
#Starting value
days_value = 0
companies.each do |company|
holdings = self.holdings_by_day_and_company(day, company)
price = Company.find_by_id(company).day_price(day)
days_value = days_value + (holdings * price).to_i
end
#Adding all companies together for that day
portfolio_value_array[index] = days_value
end
The page load time can be up to 20+ seconds - totally insane. And I've cached a lot of the requests in Memcache.
Should I not be generating reports / charts on the live data like this? Should I be running a cron task and storing them somewhere? Whats the best practice for handling this volume of data in Rails?
The Problem
Of course it's slow. You're presumably looking up large volumes of data from each table, and performing multiple lookups on multiple tables on every iteration through your loop.
One Solution (Among Many)
You need to normalize your data, create a few new models to store expensive calculated values, and push more of the calculations onto the database or into tables.
The fact that you're doing a nested loop over high-volume data is a red flag. You're making many calls to the database, when ideally you should be making as few sequential requests as possible.
I have no idea how you need to normalize your data or optimize your queries, but you can start by looking at the output of explain. In general, though, you probably will want to eliminate any full table scans and return data in larger chunks, rather than a record at a time.
This really seems more like a modeling or query problem than a Rails problem, per se. Hope this gets you pointed in the right direction!
You should precompute and store all this data on another table. An example table might look like this:
Table: PortfolioValues
Column: user_id
Column: day
Column: company_id
Column: value
Index: user_id
Then you can easily load all the user's portfolio data with a single query, for example:
current_user.portfolio_values
Since you're using memcached anyway, use it to cache some of those queries. For example:
Company.find_by_id(company).day_price(day)

Resources