how to import data into rails? - ruby-on-rails

I have a Rails 3 application with a User class, and a tab-delimited file of users that I want to import.
How do I get access to the Active Record model outside the rails console, so that I can write a script to do
require "???active-record???"
File.open("users.txt", "r").each do |line|
name, age, profession = line.strip.split("\t")
u = User.new(:name => name, :age => age, :profession => profession)
u.save
end
Do I use the "ar-extensions" gem, or is there another way? (I don't particularly care about speed right now, I just want something simple.)

You can write a rake method to so.
Add this to a my_rakes.rake file in your_app/lib/tasks folder:
desc "Import users."
task :import_users => :environment do
File.open("users.txt", "r").each do |line|
name, age, profession = line.strip.split("\t")
u = User.new(:name => name, :age => age, :profession => profession)
u.save
end
end
An then call $ rake import_users from the root folder of your app in Terminal.

Use the activerecord-import gem for bulk importing.
Install via your Gemfile:
gem 'activerecord-import'
Collect your users and import:
desc "Import users."
task :import_users => :environment do
users = File.open("users.txt", "r").map do |line|
name, age, profession = line.strip.split("\t")
User.new(:name => name, :age => age, :profession => profession)
end
User.import users
end

Related

Relationships created by a rake task are not persisted though the rails server

I'm working my first project using Neo4j. I'm parsing wikipedia's page and pagelinks dumps to create a graph where the nodes are pages and the edges are links.
I've defined some rake tasks that download the dumps, parse the data, and save it in a Neo4j database. At the end of the rake task I print the number of pages and links created, and some of the pages with the most links. Here is the output of the raks task for the zawiki.
$ rake wiki[zawiki]
[ omitted ]
...
:: Done parsing zawiki
:: 1984 pages
:: 2144 links
:: The pages with the most links are:
9625.0 - Emijrp/List_of_Wikipedians_by_number_of_edits_(bots_included): 40
1363.0 - Gvangjsih_Bouxcuengh_Swcigih: 30
9112.0 - Fuzsuih: 27
1367.0 - Cungzcoj: 26
9279.0 - Vangz_Yenfanh: 19
It looks like pages and links are being created, but when I start a rails console, or the server the links aren't found.
$ rails c
jruby-1.7.5 :013 > Pages.all.count
=> 1984
jruby-1.7.5 :003 > Pages.all.reduce(0) { |count, page| count + page.links.count}
=> 0
jruby-1.7.5 :012 > Pages.all.sort_by { |p| p.links.count }.reverse[0...5].map { |p| p.links.count }
=> [0, 0, 0, 0, 0]
Here is the rake task, and this is the projects github page. Can anyone tell me why the links aren't saved?
DUMP_DIR = Rails.root.join('lib','assets')
desc "Download wiki dumps and parse them"
task :wiki, [:wiki] => 'wiki:all'
namespace :wiki do
task :all, [:wiki] => [:get, :parse] do |t, args|
# Print info about the newly created pages and links.
link_count = 0
Pages.all.each do |page|
link_count += page.links.count
end
indent "Done parsing #{args[:wiki]}"
indent "#{Pages.count} pages"
indent "#{link_count} links"
indent "The pages with the most links are:"
Pages.all.sort_by { |a| a.links.count }.reverse[0...5].each do |page|
puts "#{page.page_id} - #{page.title}: #{page.links.count}"
end
end
desc "Download wiki page and page links database dumps to /lib/assets"
task :get, :wiki do |t, args|
indent "Downloading dumps"
sh "#{Rails.root.join('lib', "get_wiki").to_s} #{args[:wiki]}"
indent "Done"
end
desc "Parse all dumps"
task :parse, [:wiki] => 'parse:all'
namespace :parse do
task :all, [:wiki] => [:pages, :pagelinks]
desc "Read wiki page dumps from lib/assests into the database"
task :pages, [:wiki] => :environment do |t, args|
parse_dumps('page', args[:wiki]) do |obj|
page = Pages.create_from_dump(obj)
end
indent = "Created #{Pages.count} pages"
end
desc "Read wiki pagelink dumps from lib/assests into the database"
task :pagelinks, [:wiki] => :environment do |t, args|
errors = 0
parse_dumps('pagelinks', args[:wiki]) do |from_id, namespace, to_title|
from = Pages.find(:page_id => from_id)
to = Pages.find(:title => to_title)
if to.nil? || from.nil?
errors = errors.succ
else
from.links << to
from.save
end
end
end
end
end
def indent *args
print ":: "
puts args
end
def parse_dumps(dump, wiki_match, &block)
wiki_match ||= /\w+/
DUMP_DIR.entries.each do |file|
file, wiki = *(file.to_s.match(Regexp.new "(#{wiki_match})-#{dump}.sql"))
if file
indent "Parsing #{wiki} #{dump.pluralize} from #{file}"
each_value(DUMP_DIR.join(file), &block)
end
end
end
def each_value(filename)
f = File.open(filename)
num_read = 0
begin # read file until line starting with INSERT INTO
line = f.gets
end until line.match /^INSERT INTO/
begin
line = line.match(/\(.*\)[,;]/)[0] # ignore begining of line until (...) object
begin
yield line[1..-3].split(',').map { |e| e.match(/^['"].*['"]$/) ? e[1..-2] : e.to_f }
num_read = num_read.succ
line = f.gets.chomp
end while(line[0] == '(') # until next insert block, or end of file
end while line.match /^INSERT INTO/ # Until line doesn't start with (...
f.close
end
app/models/pages.rb
class Pages < Neo4j::Rails::Model
include Neo4j::NodeMixin
has_n(:links).to(Pages)
property :page_id
property :namespace, :type => Fixnum
property :title, :type => String
property :restrictions, :type => String
property :counter, :type => Fixnum
property :is_redirect, :type => Fixnum
property :is_new, :type => Fixnum
property :random, :type => Float
property :touched, :type => String
property :latest, :type => Fixnum
property :length, :type => Fixnum
property :no_title_convert, :type => Fixnum
def self.create_from_dump(obj)
# TODO: I wonder if there is a way to compine these calls
page = {}
# order of this array is important, it corresponds to the data in obj
attrs = [:page_id, :namespace, :title, :restrictions, :counter, :is_redirect,
:is_new, :random, :touched, :latest, :length, :no_title_convert]
attrs.each_index { |i| page[attrs[i]] = obj[i] }
page = Pages.create(page)
return page
end
end
I must admit that I have no idea of how Neo4j works.
Transferring from other databases though, I too assume that either some validation is wrong, or maybe even something is misconfigured in your use of the database. The latter I can't give any advice on where to look, but if it's about validation, you can look at Page#errors or try calling Page#save! and see what it raises.
One crazy idea that just came to mind looking at this example is that maybe for that relation to be configured properly, you need a back reference, too.
Maybe has_n(:links).to(Page, :links) will help you. Or, if that doesn't work:
has_n(:links_left).to(Page, :links_right)
has_n(:links_right).from(Page, :links_left)
The more I look at this, the more I think the back reference to the same table is not configured properly and thus won't validate.

Rails Rake Task - How to Delete Records

I am trying to use a daily RAKE task to synchronize a users table in my app with a CSV file.
My import.rake task successfully imports records that aren't found in the table (find_or_create_by_username), but I don't know how to delete records from the table that are no longer found in the CSV file. What should I use instead of "find_or_create_by_username" to achieve this? Thanks in advance.
#lib/tasks/import.rake
desc "Import employees from csv file"
task :import => [:environment] do
file = "db/testusers.csv"
usernames = [] # make an array to collect names
CSV.foreach(file, headers: true) do |row|
Employee.find_or_create_by_username({
# Add this line:
username = row[0]
:username => username,
:last_name => row[1],
:first_name => row[2],
:employee_number => row[3],
:phone => row[4],
:mail_station => row[5]
}
)
# Collect the usernames
usernames << username
end
# Delete the employees (make sure you fire them first)
Employee.where.not( username: usernames ).destroy_all
end
You can achieve this by doing like the following:
#lib/tasks/import.rake
require 'csv'
desc "Import employees from csv file"
task :import => [:environment] do
file = "db/users.csv"
employee_ids_to_keep = []
CSV.foreach(file, headers: true) do |row|
attrs = {
:username => row[0], :last_name => row[1], :first_name => row[2],
:employee_number => row[3], :phone => row[4],:mail_station => row[5]
}
# retrieves the Employee with username
employee = Employee.where(username: attrs[:username]).first
if employee.present? # updates the user's attributes if exists
employee.update_attributes(attrs)
else # creates the Employee if does not exist in the DB
employee = Employee.create!(attrs)
end
      # keeps the ID of the employee to not destroy it
employee_ids_to_keep << employee.id
end
Employee.where('employees.id NOT IN (?)', employee_ids_to_keep).destroy_all
end
Get a list of all ID's in the database and store them in a set. Then as you do your importing, remove valid employees from the set. Once you're done, any ID's left in the set need to be removed from the database.
Something like this...
existing_ids = Employee.pluck(:id).to_set
CSV.foreach(file, headers: true) do |row|
employee = Employee.find_or_create_by.....
existing_ids.delete(employee.id)
end
Employee.destroy(*existing_ids.to_a) unless existing_ids.empty?
usernames = [] # make an array to collect names
CSV.foreach(file, headers: true) do |row|
username = row[0]
Employee.find_or_create_by_username({
:username => username,
:last_name => row[1],
:first_name => row[2],
:employee_number => row[3],
:phone => row[4],
:mail_station => row[5]
}
)
# Collect the usernames
usernames << username
end
# Delete the employees (make sure you fire them first)
Employee.where.not( username: usernames ).destroy_all
where.not will work with rails 4 of course.

Inputting scraped data into database

Heyo,
So I built a working scraper and added the file to my app. I am now trying to take the information in the scraper and place it in my database. I am attempting to use the find_or_create method but I keep getting the following error.
breads_scraper.rb:49:in `block in summary': uninitialized constant Scraper::Bread (NameError)
from /Users/Cameron/.rvm/gems/ruby-1.9.3-p392/gems/nokogiri- 1.5.9/lib/nokogiri/xml/node_set.rb:239:in `block in each'
from /Users/Cameron/.rvm/gems/ruby-1.9.3-p392/gems/nokogiri-1.5.9/lib/nokogiri/xml/node_set.rb:238:in `upto'
from /Users/Cameron/.rvm/gems/ruby-1.9.3-p392/gems/nokogiri-1.5.9/lib/nokogiri/xml/node_set.rb:238:in `each'
from breads_scraper.rb:24:in `map'
from breads_scraper.rb:24:in `summary'
from breads_scraper.rb:57:in `<class:Scraper>'
from breads_scraper.rb:9:in `<main>'
My code looks like the following. My theory is that I am using find_or_create incorrectly, or the file doesn't know how to reach the bread method and controller.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'json'
url = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/List_of_breads"))
class Scraper
def initialize
#url = "http://en.wikipedia.org/wiki/List_of_breads"
#nodes = Nokogiri::HTML(open(#url))
end
def summary
bread_data = #nodes
breads = bread_data.css('div.mw-content-ltr table.wikitable tr')
bread_data.search('sup').remove
bread_hashes = breads.map {|x|
if content = x.css('td')[0]
name = content.text
end
if content = x.css('td a.image').map {|link| link ['href']}
image =content[0]
end
if content = x.css('td')[2]
type = content.text
end
if content = x.css('td')[3]
country = content.text
end
if content = x.css('td')[4]
description =content.text
end
{
:name => name,
:image => image,
:type => type,
:country => country,
:description => description,
}
Bread.find_or_create(:title => name, :description => description, :image_url => image, :country_origin => country, :type => type)
}
end
bready = Scraper.new
bready.summary
puts "atta boy"
end
Thanks!
Invoke the the scraper from a rake task.
lib/tasks/scraper.rake
namespace :app do
desc "Scrape breads"
task :scrape_breads => :environment do
Scraper.new.summary
end
end
Now, you can run the rake task as follows:
rake app:scrape_breads
It looks like the Bread class is not loaded.

uploading csv file to sqlite

I am trying to upload my csv data to my sqlite table, this is my code:
require 'csv'
CSV.open('history.csv', 'r') do |row|
HistoryYear.create(:year => row[1], :first => row[2], :second => row[3], :third => row[4], :regular_season_champ => row[5])
end
I am receiving an error message, NoMethodError: undefined method '[]' for #<CSV:0x3c1fe18>. I am a new to Rails and programming in general and cannot seem to find the answer.
You need to use CSV.foreach(file...) instead
You can do something like:
require 'csv'
CSV.foreach('history.csv') do |row|
HistoryYear.create(:year => row[1], :first => row[2], :second => row[3], :third => row[4], :regular_season_champ => row[5])
end
Check out http://www.ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html
Also don't forget that arrays (the rows in this case) are 0 indexed i.e. the first element of the row is row[0].

How to import large amounts of data into Rails?

To load small amounts of data, I've been using rake tasks to important data from CSVs into Rails:
desc "Import users."
task :import_users => :environment do
File.open("users.txt", "r").each do |line|
name, age, profession = line.strip.split("\t")
u = User.new(:name => name, :age => age, :profession => profession)
u.save
end
end
For larger files (about 50,000 records), though, this is incredibly slow. Is there a faster way to import the data?
You might want to take a look at activerecord-import and check out this similar thread.
Without extra libraries (I agree that a bulk import with AR extensions should be quicker)(though AR:Extension skips model validations) you can add a little bit of concurrency and take advantage of a multicore machine
# Returns the number of processor for Linux, OS X or Windows.
def number_of_processors
if RUBY_PLATFORM =~ /linux/
return `cat /proc/cpuinfo | grep processor | wc -l`.to_i
elsif RUBY_PLATFORM =~ /darwin/
return `sysctl -n hw.logicalcpu`.to_i
elsif RUBY_PLATFORM =~ /win32/
# this works for windows 2000 or greater
require 'win32ole'
wmi = WIN32OLE.connect("winmgmts://")
wmi.ExecQuery("select * from Win32_ComputerSystem").each do |system|
begin
processors = system.NumberOfLogicalProcessors
rescue
processors = 0
end
return [system.NumberOfProcessors, processors].max
end
end
raise "can't determine 'number_of_processors' for '#{RUBY_PLATFORM}'"
end
desc "Import users."
task :fork_import_users => :environment do
procs = number_of_processors
lines = IO.readlines('user.txt')
nb_lines = lines.size
slices = nb_lines / procs
procs.times do
subset = lines.slice!(0..slices)
fork do
subset.each do |line|
name, age, profession = line.strip.split("\t")
u = User.new(:name => name, :age => age, :profession => profession)
u.save
end
end
end
Process.waitall
end
on my machine with 2 cores and the fork version I get
real 1m41.974s
user 1m32.629s
sys 0m7.318s
while with your version:
real 2m56.401s
user 1m21.953s
sys 0m7.529s
You should try FasterCSV. It is quite fast and dead easy to use for me.

Resources