currently, I want to import above 55,000 records into my database from a CSV file. This is the code that I am using:
CSV.foreach(Rails.root.join('db/seeds/locations.csv'), headers: true) do |row|
val = Location.find_or_initialize_by(code: row[0])
val.name = row[1]
val.ecc = row[2] || 'MISSING'
val.created_by = User.find_by(name: 'anh')
val.updated_by = User.find_by(name: 'anh')
val.save!
end
However, it is too slow and I have just installed the gem 'postgres-copy'. I read the official documentation, and I believe I can use the class method copy_from to do the job, but if you read my current code, you can see that I am referring the data to the another table(association), and the documentation doesn't mention anything about association or validation. Therefore, I am wondering if there are any ways to solve it. This is the first time I use this gem. Thanks for reading.
I don't know that gem, but I would be very surprised if it can support multi-table copy since PostgreSQL's COPY works on a single table. 50K rows isn't all that many. You might try wrapping your insertions in transactions to avoid one commit per transaction. Doubt you want to wrap all 50K in a transaction though, but something like this:
User.connection.begin_transaction
i = 0
CSV.foreach(...) do |row|
... # your original code here
i += 1
if i % 500 == 0
User.connection.commit_transaction
User.connection.begin_transaction
end
end
User.connection.commit_transaction
This will insert your rows 500 records at a time and you should see a noticeable speed up. Play around with the value of 500 to find the sweet spot.
So, now I understand that I cannot take advantage of the COPY command in POSTGRESQL since it can't copy multiple tables. Therefore, I switch to the gem activerecord-import. Comparing with the method that Philip Hallstrom mentioned above, using activerecord-import give a faster result, 1m20s vs 1m54s to import above 8000 records.
This is my code after installing the gem activerecord-import. Hopefully, it can help other people.
locations = []
columns = [:code, :name, :ecc]
CSV.foreach(Rails.root.join('db/seeds/locations.csv'), headers: true) do |row|
val = Location.find_or_initialize_by(code: row[0])
val.name = row[1]
val.ecc = row[2] || 'MISSING'
val.created_by = User.find_by(name: 'anh')
val.updated_by = User.find_by(name: 'anh')
locations << val
end
Location.import columns, locations, validate: false
Related
This is code I have using in my project.
Please suggest some optimizations (I have refactored this code a lot but I can't think of any progress further to optimize it )
def convert_uuid_to_emails(user_payload)
return unless (user_payload[:target] == 'ticket' or user_payload[:target] == 'change')
action_data = user_payload[:actions]
action_data.each do |data|
is_add_project = data[:name] == 'add_fr_project'
is_task = data[:name] == 'add_fr_task'
next unless (is_add_project or is_task)
has_reporter_uuid = is_task && Va::Action::USER_TYPES.exclude?(data[:reporter_uuid])
user_uuids = data[:user_uuids] || []
user_uuids << data[:owner_uuid] if Va::Action::USER_TYPES.exclude?(data[:owner_uuid])
user_uuids << data[:reporter_uuid] if has_reporter_uuid
users_data = current_account.authorizations.includes(:user).where(uid: user_uuids).each_with_object({}) { |a, o| o[a.uid] = {uuid: a.uid, user_id: a.user.id, user_name: a.user.name} }
if Va::Action::USER_TYPES.include? data[:owner_uuid]
data['owner_details'] = {}
else
data['owner_details'] = users_data[data[:owner_uuid]]
users_data.delete(data[:owner_uuid])
end
data['reporter_details'] = has_reporter_uuid ? users_data[data[:reporter_uuid]] : {}
data['user_details'] = users_data.values
end
end
Note that Rubocop is complaining that your code is too hard to understand, not that it won't work correctly. The method is called convert_uuid_to_emails, but it doesn't just do that:
validates payload is one of two types
filters the items in the payload by two other types
determines the presence of various user roles in the input
shove all the found user UUIDs into an array
convert the UUIDs into users by looking them up
find them again in the array to enrich the various types of user details in the payload
This comes down to a big violation of the SRP (single responsibility principle), not to mention that it is a method that might surprise the caller with its unexpected list of side effects.
Obviously, all of these steps still need to be done, just not all in the same method.
Consider breaking these steps out into separate methods that you can compose into an enrich_payload_data method that works at a higher level of abstraction, keeping the details of how each part works local to each method. I would probably create a method that takes a UUID and converts it to a user, which can be called each time you need to look up a UUID to get the user details, as this doesn't appear to be role-specific.
The booleans is_task, is_add_project, and has_reporter_uuid are just intermediate variables that clutter up the code, and you probably won't need them if you break it down into smaller methods.
I have a CSV file which looks like this:
1ttAAAttAnaattFrench PolynesiattPFttAustralia and Oceaniatt-17.352606tt-145.509956
2ttAAEttAnnabattAlgeriattDZttAfricatt36.822225tt7.809167
3ttAAFttApalachicolattUnited StatesttUSttNorth Americatt29.7276066tt-85.0274416
4ttAAGtt\NttBrazilttBRttSouth Americatt\Ntt\N
I use this gem to fetch data: https://github.com/tilo/smarter_csv
This is the code I use to show data in terminal console:
filename = 'db/csv/airports_codes.csv'
options = {
:col_sep => 'tt',
}
records = SmarterCSV.process(filename, options)
puts records
I put these files in seeds.rb file because I will modify this code later to seed my database with data. This last line of code is there so I can see how it looks like. So I run rake db:seed
And the output is obviously huge because there are around ~5k lines. Now the first problem is that I can't see all of the data in my terminal. When I scroll to the top this is the first item (note that ID is 4674 which means it displayed last ~250 items):
{:"1"=>4674, :aaa=>"YPJ", :anaa=>"Aupaluk", :french_polynesia=>"Canada", :pf=>"CA", :australia_and_oceania=>"North America", :"_17.352606"=>59.2967, :"_145.509956"=>-69.5997}
How do I see others items?
The second problem is that key names are really weird. How do I rename them, or even better, how do I use arrays instead of hashes?
If you set the option
:headers_in_file => false
in options, that should sort the problem out.
i.e.
filename = 'db/csv/airports_codes.csv'
options = {
:col_sep => 'tt',
:headers_in_file => false
}
records = SmarterCSV.process(filename, options)
How to apply lock on particular field so the same number is not generate again.
I have created algoritham in which it create string with using Year+000..+integer number
example : "20150001","20150002","20150003" etc.
The problem is that when the multiple user request for that number at that time the same number generated.
Following function i call
def get_algo_number(model_name,prefix) <br>
year = get_year
if model_name.count > 0
last_number = model_name.last.number
if last_number[2..5].to_i > year.to_i
return create_number(year,prefix)
else
# if letest generated number already exist then generate new number
return last_number.next
end
else
return create_number(year,prefix)
end
end
Please help if you have any solution regarding apply lock.
Thanks
Yes, i resolved this problem by using multi-threading.
This code just displays the values inside the array model.request_reports
To get the most recent, I have to loop through and compare the current
report.updated_at with the last saved report.update_at value. One thing to find
out is what class the update_at field is and how to compare them against each other. The class is ActiveSupport::TimeZone
I need to keep track of the array index of the report that has the most recent updated_at as I loop so that I can access it after the loop.
The problem is, I don't know how to do this:
msg = ""
reports_arr = model.request_reports
reports_arr.each do |report|
updated_at = report.updated_at
if updated_at
msg = msg + "#{updated_at} --- "
msg = msg + "#{updated_at.class}---"
end
end
msg
To add to #meagar comment. You should be using the DB to do sorts on tables.
With that said we need to know what DB you are using as the exact command differs for each.
Mongo w/ Mongoid would be Model.order_by(:updated_at => 'desc').first
My loop had to go through the array and check by greatest date value because in the system Im using, it automatically sorts the reports array by the field "due_at" which is not the reports most recent updated record. Code below works for me.
msg = ""
reports_arr = model.request_reports
last_modified_report = model.last_modified_report
recent = nil
recent_report = nil
reports_arr.each_with_index do |report,index|
updated_at = report.updated_at
if index == 0
recent = updated_at
recent_report = report
end
if updated_at > recent
recent = updated_at
recent_report = report
end
last_modified_report = recent_report
end
msg = msg + "#{recent}---"
msg = msg + "#{recent_report}---"
msg = msg + "#{last_modified_report}"
model.last_modified_report = last_modified_report
model.save(validate: false)
msg
The OP's answer is only good if you absolutely cannot query the database for the info you want directly. I assume you only want the index so you can find the most recent one?
Even if automatic sorting is on one column, your query for the data can have it sorted on a different column.
model.request_reports.order_by(:updated_at => 'desc').first
If you have a default scope that's messing with your query, you can ask for an unscoped list, although I doubt a default ordering would cause any trouble.
model.unscoped.order_by(:updated_at => 'desc').first
You can string together queries that are already written: that can be useful even if request_reports is a query or scope you have somewhere.
It will be way less expensive than getting everything, and looping through it - you are always better off finding a way to get just the info you need in a db query if you can.
Nokogiri works fine for me in the console, but if I put it anywhere... Model, View, or Controller, it times out.
I'd like to use it 1 of 2 ways...
Controller
def show
#design = Design.find(params[:id])
doc = Nokogiri::HTML(open(design_url(#design)))
images = doc.css('.well img') ? doc.css('.well img').map{ |i| i['src'] } : []
end
or...
Model
def first_image
doc = Nokogiri::HTML(open("http://localhost:3000/blog/#{self.id}"))
image = doc.css('.well img')[0] ? doc.css('.well img')[0]['src'] : nil
self.update_attribute(:photo_url, image)
end
Both result in a timeout, though they work perfectly in the console.
When you run your Nokogiri code from the console, you're referencing your development server at localhost:3000. Thus, there are two instances running: one making the call (your console) and one answering the call (your server)
When you run it from within your app, you are referencing the app itself, which is causing an infinite loop since there is no available resource to respond to your call (that resource is the one making the call!). So you would need to be running multiple instances with something like Unicorn (or simply another localhost instance at a different port), and you would need at least one of those instances to be free to answer the Nokogiri request.
If you plan to run this in production, just know that this setup will require an available resource to answer the Nokogiri request, so you're essentially tying up 2 instances with each call. So if you have 4 instances and all 4 happen to make the call at the same time, your whole application is screwed. You'll probably experience pretty severe degradation with only 1 or 2 calls at a time as well...
Im not sure what default value of timeout.
But you can specify some timeout value like below.
require 'net/http'
http = Net::HTTP.new('localhost')
http.open_timeout = 100
http.read_timeout = 100
Nokogiri.parse(http.get("/blog/#{self.id}").body)
Finally you can find what is the problem as you can control timeout value.
So, with tyler's advice I dug into what I was doing a bit more. Because of the disconnect that ckeditor has with the images, due to carrierwave and S3, I can't get any info direct from the uploader (at least it seems that way to me).
Instead, I'm sticking with nokogiri, and it's working wonderfully. I realized what I was actually doing with the open() command, and it was completely unnecessary. Nokogiri parses HTML. I can give it HTML in for form of #design.content! Duh, on my part.
So, this is how I'm scraping my own site, to get the images associated with a blog entry:
designs_controller.rb
def create
params[:design][:photo_url] = Nokogiri::HTML(params[:design][:content]).css('img').map{ |i| i['src']}[0]
#design = Design.new(params[:design])
if #design.save
flash[:success] = "Design created"
redirect_to designs_url
else
render 'designs/new'
end
end
def show
#design = Design.find(params[:id])
#categories = #design.categories
#tags = #categories.map {|c| c.name}
#related = Design.joins(:categories).where('categories.name' => #tags).reject {|d| d.id == #design.id}.uniq
set_meta_tags og: {
title: #design.name,
type: 'article',
url: design_url(#design),
image: Nokogiri::HTML(#design.content).css('img').map{ |i| i['src']},
article: {
published_time: #design.published_at.to_datetime,
modified_time: #design.updated_at.to_datetime,
author: 'Alphabetic Design',
section: 'Designs',
tag: #tags
}
}
end
The Update action has the same code for Nokogiri as the Create action.
Seems kind of obvious now that I'm looking at it, lol. I dwelled on this for longer than I'd like to admit...