CSV Import to multiple tables - speed consideration - ruby-on-rails

I have an app that will take sales made available to vendors at Whole Foods and process the daily sales data by store and item. All the parent information is stored in one downloaded CSV with about 10,000 lines per month.
The importing process checks for new stores before importing the sale information.
I don't know how to track 'time' of processes in ruby and rails but i was wondering if it would be 'faster' to process one line at a time to each table or to process the file for one table (stores) and then to the other table (sales)
If it matters in anything, new stores are not often added though stores might be closed (and the import checks for that as well), so the scan through the stores might only add a few new entries whereas every row of the csv is added to the sales.
If this isn't appropriate - I apologize - still working out the kinks of the rules

When it comes to processing data with Ruby the memory consumption is what you should be concerned about.
With csv processing in Ruby, the best you can do is reading line by line:
file = CSV.open("data.csv")
while line = file.readline
# do stuff
end
This way no matter how many lines are in the file, there always be only single one (+ previous processed one) loaded into memory at a time - GC will collect processed lines as your program executes. This way is almost no-memory consumptive + it will speed up the parsing process, too.
i was wondering if it would be 'faster' to process one line at a time
to each table or to process the file for one table (stores) and then
to the other table (sales)
I would go with one line at a time to each table.

Related

Parsing large CSVs while searching Database

Currently have a tricky problem and need ideas for most efficient way to go about solving.
We periodically iterate through large CSV files (~50000 to 2m rows), and for each row, we need to check a database table for matching columns.
So for example, each CSV row could have details about an Event - artist, venue, date/time etc, and for each row, we check our database (PG) for any rows that match the artist, venue and date/time the most, and then perform operations if any match is found.
Currently, the entire process is highly CPU, memory and time intensive pulling row by row, so we perform the matching in batches, but still seeking ideas for an efficient way to perform the comparison both memory-wise, and time-wise
Thanks.
Load the complete CSV file into a temporary table in your database (using a DB tool, see e.g. How to import CSV file data into a PostgreSQL table?)
Perform matching and operations in-database, i.e. in SQL
If necessary, truncate the temporary table afterwards
This would move most of the load into the DB server, avoiding all the ActiveRecord overhead (network traffic, result parsing, model instantiation etc.)

Import CSV data faster in rails

I am building an import module to import a large set of orders from a csv file. I have a model called Order where the data needs to be stored.
A simplified version of the Order model is below
sku
quantity
value
customer_email
order_date
status
When importing the data two things have to happen
Any dates or currencies need to be cleaned up i.e. dates are represented as strings in the csv, this needs to be converted into a Rails Date object and currencies need to converted to a decimal by removing any commas or dollar signs
If a row already exists it has to be updated, the uniqueness is checked based on two columns.
Currently I use a simple csv import code
CSV.foreach("orders.csv") do |row|
order = Order.first_or_initialize(sku: row[0], customer_email: row[3])
order.quantity = row[1]
order.value= parse_currency(row[2])
order.order_date = parse_date(row[4])
order.status = row[5]
order.save!
end
Where parse_currency and parse_date are two functions used to extract the values from strings. In the case of the date it is just a wrapper for Date.strptime.
I can add a check to see if the record already exists and do nothing in case it already exists and that should save a little time. But I am looking for something that is significantly faster. Currently importing around 100k rows takes around 30 mins with an empty database. It will get slower as the data size increases.
So I am basically looking for a faster way to import the data.
Any help would be appreciated.
Edit
After some more testing based on the comments here I have an observation and a question. I am not sure if they should go here or if I need to open a new thread for the questions. So please let me know if I have to move this to a separate question.
I ran a test using Postgres copy to import the data from the file and it took less than a minute. I just imported the data into a new table without any validations. So the import can be much faster.
The Rails overhead seems to be coming from 2 places
The multiple database calls that are happening i.e. the first_or_initialize for each row. This ends up becoming multiple SQL calls because it has to first find the record and then update it and then save it.
Bandwidth. Each time the SQL server is called the data flows back and forth which adds up to a lot of time
Now for my question. How do I move the update/create logic to the database i.e. If an order already exists based on the sku and customer_email it needs to update the record else a new record needs to be created. Currently with rails I am using the first_or_initialize method to get the record in case it exists and update it, else I am creating a new one and saving it. How do I do that in SQL.
I could run a raw SQL query using ActiveRecord connection execute but I do not think that would be a very elegant way of doing it. Is there a better way of doing that?
Since ruby 1.9 fastcsv is now part of ruby core. You don't need to use a special gem. Simply use CSV.
With 100k records ruby takes 0.018 secs / record. In my opinion most of your time will be used within Order.first_or_initialize. This part of your code takes an extra roundtrip to your database. Initialization of an ActiveRecord takes it time too. But to realy be sure I would suggest that you benchmark your code.
Benchmark.bm do |x|
x.report("CSV evel") { CSV.foreach("orders.csv") {} }
x.report("Init: ") { 1.upto(100_000) {Order.first_or_initialize(sku: rand(...), customer_email: rand(...))} } # use rand query to prevent query caching
x.report('parse_currency') { 1.upto(100_000) { parse_currency(...} }
x.report('parse_date') { 1.upto(100_000) { parse_date(...} }
end
You should also watch memory consumption during your import. Maybe the garbage collection does not run often enough or objects are not cleaned up.
To gain speed you can follow Matt Brictson hint and bypass ActiveRecord.
You can try the gem activerecord-import or you can start to go parallel, for instance multiprocessing with fork or multithreading with Thread.new.

What's the best pattern for logging data on a Stateful Object?

Currently I'm thinking about adding a json array column (I'm using postgres) and just pumping log messages for the object into this attribute. I want to log progress (The object is an import report that does a lot of stuff and takes a while so it's useful to have a sense of what's currently happening - how many rows have been imported, how many rows have been normalized, etc -
The other option is to add one of the gems that allow you to see logs streamed in a view, but this I think isn't as useful since what I'm looking for is something where I can see the history of this specific object.
Using a json column or json[] (PostgreSQL array of json) is a very bad idea for logging.
Each time you update it, the whole column contents must be read, modified in memory, and written out again in their entirety.
Instead, create a table used for logs for objects of this kind, with a FK to the table being logged and a timestamp for each entry. Insert a row for each log entry.
BTW, if the report runs in a single transaction, other clients won't be able to see any of the log rows until the whole view commits, in which case it won't be good for progress monitoring, but neither will your original idea. You'll need to use NOTICE messages instead.

How should I auto-expire entires in an ETS table, while also limiting its total size?

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)

How to create staging table to handle incremental load

We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.
In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.
The query which is loading tbl_EmployeRecord is,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.
The columns of the tables are,
tbl_Department : DEPARTMENTID,DEPTNAME
tbl_Employee : EMPID,EMPNAME,DEPARTMENTID
tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME
Kindly suggest how to design the staging for this to handle Insert, Update and Delete.
Identifying Incremental Data
The incremental loading needs to be based on some segregating information present in your source table. Such information helps you to identify the incremental portion of the data that you will load. Often times, load date or last updated date of the record is a good choice for this.
Consider this, your source table has a date column that stores both the date of insertion of the records as well as the date when any update was done on that record. At any given day during your staging load, you may take advantage of this date to identify which are the records that are newly inserted or updated since your last staging load and you consider only those changed / updated records as your incremental delta.
Given your structure of the tables, I am not sure which column you may use for this. ID columns will not help as if the record gets updated you won't know that.
Maintaining Load History
It is important to store information about how much you have loaded today so that you can load the next part in the next load. To do this, maintain a staging table - often called Batch Load Details table. That load typically will have structure such as below:
BATCH ID | START DATE | END DATE | LOAD DATE | STATUS
------------------------------------------------------
1 | 01-Jan-14 | 02-Jan-14 | 02-Jan-14 | Success
You need to insert a new record in this table everyday before you start the data loading. The new record will have start date equal to the end date of last successful load and status null. Once loading is successful, you will update the status to 'Success'
Modification in data Extraction Query to take Advantage of Batch Load Table
Once you maintain your loading history like above, you may include this table in your extraction query,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
WHERE E.load_date >= (SELECT max(START_DATE) FROM BATCH_LOAD WHERE status IS NULL)
What I am going suggest you is by no means a standard. In fact you should evaluate my suggestion carefully against your requirement.
Suggestion
Use incremental loading for transaction data, not for master data. Transaction data are generally higher in volume and can be easily segregated in incremental chunks. Master data tend to be more manageable and can be loaded in Full everytime. In the above example, I am assuming your Employee table is behaving like transactional data whereas your department table is your master.
I trust this article on incremental loading will be very helpful for you
I'm not sure what database you are using, so I'll just talk in conceptual terms. If you want to add tags for specific technologies, we can probably provide specific advice.
It looks like you have 1 row per employee and that you are only keeping the current record for each employee. I'm going to assume that EMPIDs are unique.
First, add a field to the query that currently populates the dimension. This field will be a hash of the other fields in the table EMPID, EMPNAME, DEPTNAME. You can create a view, populate a new staging table, or just use the query. Also add this same hash field to the dimension table. Basically, the hash is an easy way to generate a field that is unique for each record and efficient to compare.
Inserts: These are the records for which the EMPID does not already exist in the dimension table but does exist in your staging query/view.
Updates: These are the records for which the EMPID does in both the staging query/view the dimension table, but the hash field doesn't match.
Deletes: These are the records for which the EMPID exists in the dimension but does not exist in the staging query/view.
If this will be high-volume, you may want to create new tables to hold the records that should be inserted and the records that should be updated. Once you have identified the records, you can insert/update them all at once instead of one-by-one.
It's a bit uncommon to delete lots of records from a data warehouse as they are typically used to keep history. I would suggest perhaps creating a column that is a status or a bit field that indicates if is is active or deleted in the source. Of course, how you handle deletes should be dependent upon your business needs/reporting requirements. Just remember that if you do a hard delete you can never get that data back if you decide you need it later.
Updating the the existing dimension in place (rather than creating historical records for each change) is called a Type 1 dimension in dimensional modeling terms. This is fairly common. But if you decide you need to keep history, you can use the hash to help you create the SCD type 2 records.

Resources