I have a csv importing system on my app (used locally only) which parses the csv file line by line and adds the data to the database table. This is based on a tutorial here.
require 'csv'
def csv_import
#parsed_file=CSV::Reader.parse(params[:dump][:file])
n = 0
#parsed_file.each_with_index do |row, i|
next if i == 0 #ignore the first row
course = Course.new
course.title = row[0]
course.unit_code = row[1]
course.course_type = row[2]
course.value = row[3]
course.pass_mark = row[4]
if course.save
n = n+1
GC.start if n%50==0
end
flash.now[:message] = "CSV Import Successful, #{n} new courses added to the database."
end
redirect_to(courses_url)
end
This is all in the courses controller and works fine. There is a relationship that courses HABTM years and years HABTM courses. In the csv file (effectively in row[5] to row[8]) are the year_id s. Is there a way that I can add this within the method above. I am confused as to how to loop over the 4 items and add them to the courses_years table.
Thank you
Jack
You can do this by adding a simple loop after your "normal" data is added to the model, and using the << method to append to the years association.
...
course.value = row[3]
course.pass_mark = row[4]
5.upto(8).each do |i|
one_year = Year.find(row[i])
course.years << one_year if one_year
end
if course.save
n = n+1
...
You can add more checks in the loop if you want to make sure that the values are valid, and/or change the find to locate your year in another way. Another way when the related data is "trailing off the end" like this is to keep adding until there is nothing left to add, and also to add the years themselves if they don't exist yet:
...
course.value = row[3]
course.pass_mark = row[4]
row[5..-1].each do |year_id|
one_year = Year.find_or_create_by_id(year_id)
course.years << one_year
end
if course.save
n = n+1
...
There are a lot of different ways to do this, and the way which is right is really dependent on your actual data, but this is the basic method.
Have you tried to put either one of these before you save the course:
course.years.push(row[5])
course.years.push(row[6])
course.years.push(row[7])
course.years.push(row[8])
OR
course.years = [ row[5], row[6], row[7], row[8] ]
Place it before you save the course. It will fill the joint table courses_years.
EDIT
The error that you get seems to be because we are trying to put id's instead of objects, we should do this instead:
.....
year_array = Year.find(row[5], row[6], row[7], row[8])
course.years << year_array
.....
After we get the year objects, then we put it inside the association. You can save the course object after that.
Related
I have a model named Vendor. I have three different models associated with it.
Testimonial
Service
Works
I have to look up each of the tables and list the vendors(limit 5) which have the word "example" most number of times in a column in one of these models. How do I do this in rails?
I have something like this
Vendor.joins(:testimonials).where("testimonials.content LIKE ?","%example%")
I need to find the vendors who has the maximum number of occurrences of the word "example"
I hope I got you right now:
a=[]
vendors.each do |v|
c=0
c=v.testimonial.scan(/example/).count
if c>0
a=a.concat([{vendor: v, counts: c}])
end
In Ruby you can count the occurrence of a substring in a string in this way, for example
s = "this is a string with an example, example, example"
s.scan(/example/).count # => 3
This is how I ended up doing this. Not sure if my question was asked correctly. I made this with the help of a previous answer by #user3383458
vendor = Hash.new
Vendor.all.each do |v|
testi_word_count = 0
service_word_count = 0
title_word_count = 0
v.testimonials.each do |testi|
testi_word_count += testi.content.scan(/#{word_to_search}/).count
Rails.logger.info "testi_word_count"
Rails.logger.info testi_word_count
end
v.services.each do |service|
service_word_count += service.name.scan(/#{word_to_search}/).count
Rails.logger.info "service_word_count"
Rails.logger.info service_word_count
end
v.works.each do |work|
title_word_count += work.title.scan(/#{word_to_search}/).count
Rails.logger.info "title_word_count"
Rails.logger.info title_word_count
end
vendor[v]=testi_word_count + service_word_count + title_word_count
end
i develop a heroku rails application on the cedar stack and this is the bottle neck.
def self.to_csvAlt(options = {})
CSV.generate(options) do |csv|
column_headers = ["user_id", "session_id", "survey_id"]
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
page_attributes = ["a", "b", "c", "d", "e"]
pages.each do |p|
page_attributes.each do |pa|
column_headers << p + "_" + pa
end
end
csv << column_headers
session_ids = PageEvent.order(:session_id).select(:session_id).map(&:session_id).uniq
session_ids.each do |si|
session_user = PageEvent.find(:first, :conditions => ["session_id = ? AND page != ?", si, 'none']);
if session_user.nil?
row = [si, nil, nil, nil]
else
row = [session_user.username, si, session_user.survey_name]
end
pages.each do |p|
a = 0
b = 0
c = 0
d = 0
e = 0
allpages = PageEvent.where(:page => p, :session_id => si)
allpages.each do |ap|
a += ap.a
b += ap.b
c += ap.c
d += ap.d
e += ap.e
end
index = pages.index p
end_index = (index + 1)*5 + 2
if !p.nil?
row[end_index] = a
row[end_index-1] = b
row[end_index-2] = c
row[end_index-3] = d
row[end_index-4] = e
else
row[end_index] = nil
row[end_index-1] = nil
row[end_index-2] = nil
row[end_index-3] = nil
row[end_index-4] = nil
end
end
csv << row
end
end
end
as you can see, it generates a csv file from a table that contains data on each individual page taken from a group of surveys. the problem is that there are ~50,000 individual pages in the table and the heroku app continues to give me R14 errors (out of memory 512MB) and eventually dies when the dyno goes to sleep after an hour.
that being said, i really dont care how long it takes to run, i just need it to complete. i am waiting on approval to add a worker dyno to run the csv generation, which i know will help but in the meantime i still would like to optimize this code. There is potential for over 100,000 pages to be processed at at time and i realize this is incredibly memory heavy and really need to cut back its memory usage as much as possible. thank you for your time.
You can split it up into batches so that the work is completed in sensible chunks.
Try something like this:
def self.to_csvAlt(options = {})
# ...
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
pages.find_each(:batch_size => 5000) do |p|
# ...
Using find_each with a batch_size, you wont do one huge lookup for your loop. Instead it'll fetch 5000 rows, run your loop, fetch another, loop again ... etc, until you have no more records returned.
The other key thing to note here is that rather than rails trying to instantiate all of the objects returned from the database at the same time, it will only instantiate those returned in your current batch. This can save a huge memory overhead if you have a giant dataset.
UPDATE:
Using #map to restrict your results to a single attribute of your model is highly inefficient. You should instead use the pluck Active record method to just pull back the data you want from the DB directly rather than manipulating the results with Ruby, like this:
# Instead of this:
pages = PageEvent.order(:page).select(:page).map(&:page).uniq
# Use this:
pages = PageEvent.order(:page).pluck(:page).uniq
I also personally prefer to use .distinct rather than the alias .uniq as I feel it sits more in line with the DB query rather than confusing things with what seems more like an array function:
pages = PageEvent.order(:page).pluck(:page).distinct
Use
CSV.open("path/to/file.csv", "wb")
This will stream CSV into the file.
Instead of CSV.generate.
generate will create a huge string that will end up exasting memory if it gets too large.
Slowly getting there with what i am trying to achieve. I am grabbing data via screen grab and want to save the data to my model, i have two columns, home_team and away_team. So far i grab the data.
FIXTURE_URL = "http://www.bbc.co.uk/sport/football/premier-league/fixtures"
def get_fixtures # Get me all Home and away Teams
doc = Nokogiri::HTML(open(FIXTURE_URL))
home_team = doc.css(".team-home.teams").map {|h| h.text.strip }
away_team = doc.css(".team-away.teams").map {|a| a.text.strip }
#team_clean = Hash[:home_team => home_team, :away_team => away_team]
#team_clean = Hash[:team_clean => [Hash[:home_team => home_team, :away_team => away_team]]]
end
I have hashed out the two ways of getting the data into a hash, one is a hash and the other is a hash within a hash, I am not sure which one i need (if any?)
So if i want to save the data received from my home_team i run a rake task to do this
def update_fixtures #rake task method
Fixture.destroy_all
get_fixtures.each {|home| Fixture.create(:home_team => home )}
end
What i want to achieve is to be able to save home_team and away_team at the same time. Do i need to access the data within the hash, if so how? Bit lost here, but this is the first time i am attempting this
any help appreciated
Try this,
FIXTURE_URL = "http://www.bbc.co.uk/sport/football/premier-league/fixtures"
def get_fixtures # Get me all Home and away Teams
doc = Nokogiri::HTML(open(FIXTURE_URL))
matches = doc.css('tr.preview')
matches.each do |match|
home_team = match.css('.team-home').text.strip
away_team = match.css('.team-away').text.strip
Fixture.create!(home_team: home_team, away_team: away_team)
end
end
This will loop through the matches and create a new Fixture with away and home teams for each match.
Edit:
Added .text.strip
Edit 2:
This should get you the dates too,
FIXTURE_URL = "http://www.bbc.co.uk/sport/football/premier-league/fixtures"
def get_fixtures # Get me all Home and away Teams
doc = Nokogiri::HTML(open(FIXTURE_URL))
days = doc.css('#fixtures-data h2').each do |h2_tag|
date = Date.parse(h2_tag.text.strip)
matches = h2_tag.xpath('following-sibling::*[1]').css('tr.preview')
matches.each do |match|
home_team = match.css('.team-home').text.strip
away_team = match.css('.team-away').text.strip
Fixture.create!(home_team: home_team, away_team: away_team, date: date)
end
end
end
It's a bit more complicated than the previous code because it has to use some XPath to call the next HTML element after the h2 tag containing the date.
It loops through all the h2 html tags in the div#fixtures-data HTML then grabs the table tag directly below/after each h2.
Prior to Rails 3.1, we could update the self.columns method of ActiveRecord::Base.
But that doesn't seem to work now.
Now it seems if I remove a column from a table, I am forced to restart the Rails server. If I don't I keep getting errors when INSERTs to the table happen. Rails still thinks the old column exists, even though it's not in the database anymore.
Active Record does not support this out of the box, because it queries the database to get the columns of a model (unlike Merb's ORM tool, Datamapper).
Nonetheless, you can patch this feature on Rails with (assuming, for instance, you want to ignore columns starting with "deprecated" string):
module ActiveRecord
module ConnectionAdapters
class SchemaCache
def initialize(conn)
#connection = conn
#tables = {}
#columns = Hash.new do |h, table_name|
columns = conn.columns(table_name, "#{table_name} Columns").reject { |c| c.name.start_with? "deprecated"}
h[table_name] = columns
end
#columns_hash = Hash.new do |h, table_name|
h[table_name] = Hash[columns[table_name].map { |col|
[col.name, col]
}]
end
#primary_keys = Hash.new do |h, table_name|
h[table_name] = table_exists?(table_name) ? conn.primary_key(table_name) : nil
end
end
end
end
end
You can clear the ActiveRecord schema cache:
ActiveRecord::Base.connection.schema_cache.clear_table_cache(:table_name)!
Then it'll be reloaded the next time you reference a model that uses that table.
As you can see in the current code below, I am finding the duplicate based on the attribute recordable_id. What I need to do is find the duplicate based on four matching attributes: user_id, recordable_type, hero_type, recordable_id. How must I modify the code?
heroes = User.heroes
for hero in heroes
hero_statuses = hero.hero_statuses
seen = []
hero_statuses.sort! {|a,b| a.created_at <=> b.created_at } # sort by created_at
hero_statuses.each do |hero_status|
if seen.map(&:recordable_id).include? hero_status.recordable_id # check if the id has been seen already
hero_status.revoke
else
seen << hero_status # if not, add it to the seen array
end
end
end
Try this:
HeroStatus.all(:group => "user_id, recordable_type, hero_type, recordable_id",
:having => "count(*) > 1").each do |status|
status.revoke
end
Edit 2
To revoke the all the latest duplicate entries do the following:
HeroStatus.all(:joins => "(
SELECT user_id, recordable_type, hero_type,
recordable_id, MIN(created_at) AS created_at
FROM hero_statuses
GROUP BY user_id, recordable_type, hero_type, recordable_id
HAVING COUNT(*) > 1
) AS A ON A.user_id = hero_statuses.user_id AND
A.recordable_type = hero_statuses.recordable_type AND
A.hero_type = hero_statuses.hero_type AND
A.recordable_id = hero_statuses.recordable_id AND
A.created_at < hero_statuses.created_
").each do |status|
status.revoke
end
Using straight Ruby (not the SQL server):
heroes = User.heroes
for hero in heroes
hero_statuses = hero.hero_statuses
seen = {}
hero_statuses.sort_by!(&:created_at)
hero_statuses.each do |status|
key = [status.user_id, status.recordable_type, status.hero_type, status.recordable_id]
if seen.has_key?(key)
status.revoke
else
seen[key] = status # if not, add it to the seen array
end
end
remaining = seen.values
end
For lookups, always use Hash (or Set, but here I thought it would be nice to keep the statuses that have been kept)
Note: I used sort_by!, but that's new to 1.9.2, so use sort_by (or require "backports")