I have a large XLS file with postal codes, the problem: is a quite slow to read the data, for example, the file have multiple sheets with the state name, into each sheet they are a multiple rows with postal code, neighborhoods and municipality. The file have 33 states, each state have between 1000 and 9000 rows.
I try to parse this to an array of hashes, which one take 22 seconds. Is there any way to read this faster?
This is how I read the sheet
def read_sheet(sheet_name:, offset: 1)
sheet = file.worksheet sheet_name[0..30]
clean_data = sheet.each_with_index(offset)
.lazy
.reject{|k, _| !k.any?}
data = clean_data.map do |row, _index|
DATA_MAPPING.map do |field, column|
{ field => row[column] }
end.inject(:merge)
end
yield data
end
And I retrieve all with
def read_file
result = {}
sheets_titles.each_with_index do |name|
read_sheet(sheet_name: name) do |data|
result.merge! name => data.to_a
end
end
result
end
So, if I use .to_a or .to_json or any method to process the data and insert to DB, I have to wait few seconds ... any suggestion?
Related
I have a table that has a few thousand sets of 2-3 nearly identical records, that all share a unique "id" (not database ID, but item id). That is, two to three records share the same item id and there are about 2100 records, or ~700 unique items. Example:
{id: 1, product_id:333, is_special:true}, {id:2, product_id:333, is_special:false}, {id:3, product_id:333, is_special:false}, {id:4, product_id:334, is_special:false}...
I'd like to perform a query that lets me iterate over each set, modify/remove duplicate records, then move on to the next set.
This is what I currently have:
task find_averages: :environment do
responses = Response.all
chunked_responses = responses.chunk_while { |a,b| a.result_id == b.result_id}.to_a
chunked_responses.each do |chunk|
if chunk.length < 3
chunk.each do |chunky_response|
chunky_response.flagged = true
chunky_response.save
end
else
chunk.each do |chunky_response|
**manipulate each item in the chunk here**
end
end
end
end
Edit. I worked this one out after discovering the chunk_while method. I am not positive chunk_while is the most efficient method here, but it works well. I am closing this, but anyone else that needs to group records and then iterate over them, this should help.
The following code should iterate over an array of items that some of them share common values, and group them by common values:
responses = Responses.all
chunked_responses = responses.chunk_while { |a,b| a.result_id == b.result_id}.to_a
chunked_responses.each do |chunk|
chunk.each do |chunky_response|
**manipulate each item in the chunk here**
end
end
The problem
I'm trying to parse huge CSV file (27mb) and delete big amount of rows, but running in performance issues.
Specifications
rails version 4.2.0, Posgtres as db client
videos table has 300000 rows
categories_videos pivot table has 885000 rows
To load the external csv file takes 29097ms
External CSV file has 3117000 lines (1 deleted video id per line)
The task
I have large CSV file 27MB with the IDs of videos which were deleted and I have to go through this file and check if there are any videos in my database that would have matching ID and if they have delete them from my db.
1) roughly 126724ms (per chunk)
file_location = 'http://my_external_source/file.csv';
open(file_location, 'r:utf-8') do |f|
data = SmarterCSV.process(f, { :headers_in_file => false, :user_provided_headers => ["id"], :chunk_size => 1000 }) do |chunk|
chunk = chunk.map{ |row| row[:id] }
Video.delete_all(:id => chunk)
VideoCategoryRelation.delete_all(:video_video_id => chunk)
end
end
2) roughly 90000ms (per chunk)
file_location = 'http://my_external_source/file.csv';
open(file_location, 'r:utf-8') do |f|
data = SmarterCSV.process(f, { :headers_in_file => false, :user_provided_headers => ["id"], :chunk_size => 1000 }) do |chunk|
chunk = chunk.map{ |row| row[:id] }
Video.where(:video_id => chunk).destroy_all
end
end
Is there any efficient way how to go through this that would note take hours?
I don't know Ruby or the database you are using, but it looks like there are a lot of separate delete calls to the database.
Here's what I would try to speed things up:
First, make sure you have an index on id in both tables.
In each table, create a field (boolean or small int) to mark a record for deletion. In your loop, instead of deleting, just set the deletion field to true (this should be fast if you have an index on id). And only at the end call delete once on each table (delete from table where the deletemarker is true).
Context:
Trying to generating an array with 1 element for each created_at day in db table. Each element is the average of the points (integer) column from records with that created_at day.
This will later be graphed to display the avg number of points on each day.
Result:
I've been successful in doing this, but it feels like an unnecessary amount of code to generate the desired result.
Code:
def daily_avg
# get all data for current user
records = current_user.rounds
# make array of long dates
long_date_array = records.pluck(:created_at)
# create array to store short dates
short_date_array = []
# remove time of day
long_date_array.each do |date|
short_date_array << date.strftime('%Y%m%d')
end
# remove duplicate dates
short_date_array.uniq!
# array of avg by date
array_of_avg_values = []
# iterate through each day
short_date_array.each do |date|
temp_array = []
# make array of records with this day
records.each do |record|
if date === record.created_at.strftime('%Y%m%d')
temp_array << record.audio_points
end
end
# calc avg by day and append to array_of_avg_values
array_of_avg_values << temp_array.inject(0.0) { |sum, el| sum + el } / temp_array.size
end
render json: array_of_avg_values
end
Question:
I think this is a common extraction problem needing to be solved by lots of applications, so I'm wondering if there's a known repeatable pattern for solving something like this?
Or a more optimal way to solve this?
(I'm barely a junior developer so any advice you can share would be appreciated!)
Yes, that's a lot of unnecessary stuff when you can just go down to SQL to do it (I'm assuming you have a class called Round in your app):
class Round
DAILY_AVERAGE_SELECT = "SELECT
DATE(rounds.created_at) AS day_date,
AVG(rounds.audio_points) AS audio_points
FROM rounds
WHERE rounds.user_id = ?
GROUP BY DATE(rounds.created_at)
"
def self.daily_average(user_id)
connection.select_all(sanitize_sql_array([DAILY_AVERAGE_SELECT, user_id]), "daily-average")
end
end
Doing this straight into the database will be faster (and also include less code) than doing it in ruby as you're doing now.
I advice you to do something like this:
grouped =
records.order(:created_at).group_by do |r|
r.created_at.strftime('%Y%m%d')
end
At first here you generate proper SQL near to that you wish to get in first approximation, then group result records by created_at field converted to just a date.
points =
grouped.map do |(date, values)|
[ date, values.reduce(0.0, :audio_points) / values.size ]
end.to_h
# => { "1-1-1970" => 155.0, ... }
Then you remap your grouped hash via array, to calculate average values with audio_points.
You can use group and calculations methods built in AR: http://guides.rubyonrails.org/active_record_querying.html#group
http://guides.rubyonrails.org/active_record_querying.html#calculations
I would like to create a Capybara method for reading the contents of a table, that takes a variable number of parameters and iterates through the parameters.
Here is the method I have:
Then /^I should see a table record with "(.*?)", "(.*?)", "(.*?)"$/ do |invisible, name, address, phone|
rows = page.all(".table-bordered tr")
expect(rows.any? { |record| record.has_content? name }).to be_true
rows.each do |record|
if record.has_content? name
expect(record.has_content? address).to be_true
expect(record.has_content? phone).to be_true
end
end
end
I'm using the same CSS table structure to create tables with much larger numbers of columns elsewhere in the program. So whether the table has 3 columns or 12, I'd like to be able to use the same method so I don't write awkward code.
How can I assign a variable number of parameters and loop through each parameter in Capybara?
def assert_my_table(name, *row_data)
# It will be much faster than looping through all rows
row = page.find(:xpath, "//*[#class='table-bordered']//tr[./td='#{name}']")
# retrive row contents only once (again, it will be faster than retrieving it again for each of the columns you want to assert)
row_text = row.text
row_data.each do |text|
expect(row_text).to include(text)
end
end
assert_my_table(name, address, phone)
I have an array of 300K strings which represent dates:
date_array = [
"2007-03-25 14:24:29",
"2007-03-25 14:27:00",
...
]
I need to count occurrences of each date in this array (e.g., all date strings for "2011-03-25"). The exact time doesn't matter -- just the date. I know the range of dates within the file. So I have:
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
count = 0
date_array.each do |date_string|
if Date.parse(date_string) >= date_to_count &&
Date.parse(date_string) <= date_to_count
count += 1
end
end
puts "#{date_to_count} occurred #{count} times."
end
Counting occurrences of just one date takes longer than 60 seconds on my machine. In what ways can I optimize the performance of this task?
Possibly useful notes: I'm using Ruby 1.9.2. This script is running in a Rake task with rake 0.9.2. The date_array is loaded from a CSV file. On each iteration, the count is saved as a record in my Rails project database.
Yes, you don't need to parse the dates at all if they are formatted the same. Knowing your data is one of the most powerful tools you can have.
If the datetime strings are all in the same format (yyyy-mm-dd HH:MM:SS) then you could do something like
data_array.group_by{|datetime| datetime[0..9]}
This will give you a hash like with the date strings as the keys and the array of dates as values
{
"2007-05-06" => [...],
"2007-05-07" => [...],
...
}
So you'd have to get the length of each array
data_array.group_by{|datetime| datatime[0..9]}.each do |date_string, date_array|
puts "#{date_string} occurred #{date_array.length} times."
end
Of course that method is wasting memory by arrays of dates when you don't need them.
so how about
A more memory-efficient method
date_counts = {}
date_array.each do |date_string|
date = date_string[0..9]
date_counts[date] ||= 0 # initialize count if necessary
date_counts[date] += 1
end
You'll end up with a hash with the date strings as the keys and the counts as values
{
"2007-05-06" => 123,
"2007-05-07" => 456,
...
}
Putting everything together
date_counts = {}
date_array.each do |date_string|
date = date_string[0..9]
date_counts[date] ||= 0 # initialize count if necessary
date_counts[date] += 1
end
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
puts "#{date_to_count} occurred #{date_counts[date_to_count.to_s].to_i} times."
end
This is a really awful algorithm to use. You're scanning through the entire list for each date, and further, you're parsing the same date twice for no apparent reason. That means for N dates in the range and M dates in the list you're doing N*M*2 date parses.
What you really need is to use group_by and do it in one pass:
dates = date_array.group_by do |date_string|
Date.parse(date_string)
end
Then you can use this as a reference for your counts:
Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
puts "#{date_to_count} occurred #{dates[date_to_count] ? dates[date_to_count].length : 0} times."
end