invalid byte sequence in UTF-8 CSV Rails 4 - ruby-on-rails

I'm getting:
ArgumentError invalid byte sequence in UTF-8
With my Resque job
Below is my stack trace:
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/2.0.0/csv.rb:1780:in `sub!'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/2.0.0/csv.rb:1780:in `block in shift'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/2.0.0/csv.rb:1774:in `loop'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/2.0.0/csv.rb:1774:in `shift'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/2.0.0/csv.rb:1716:in `each'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/2.0.0/csv.rb:1730:in `to_a'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/2.0.0/csv.rb:1730:in `read'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/2.0.0/csv.rb:1291:in `parse'
C:/BitNami/rubystack-2.0.0-4/projects/virtual_exhibition/app/jobs/users.rb:14:in `parse_csv'
C:/BitNami/rubystack-2.0.0-4/projects/virtual_exhibition/app/jobs/users.rb:6:in `perform'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/gems/2.0.0/gems/resque-status-0.4.2/lib/resque/plugins/status.rb:161:in `safe_perform!'
C:/BitNami/rubystack-2.0.0-4/ruby/lib/ruby/gems/2.0.0/gems/resque-status-0.4.2/lib/resque/plugins/status.rb:137:in `perform'
Also below is my job getting called
class UserJob
include Resque::Plugins::Status
def perform
puts "Parsing CSV and updating..."
parse_csv
puts "Update finished..."
end
def parse_csv
#counter = 0
#row = []
csv_text = File.read("#{Rails.public_path}/careersfair.csv").encode('UTF-8')
csv = CSV.parse(csv_text, headers: false)
csv.each do |row|
user = User.find_by_email row[3]
puts user.inspect
if user.present?
user.update(:first_name => row[0], :last_name => row[1], :industry => row[2], :event_ids => 1, :skip_invitation => true)
puts #counter += 1
else
puts "Not found - #{row[3]}"
end
end
end
end
It seems CSV.parse is failing.
Is there a reason why this is happening?

I think your csv file has some invalid characters. Change "csv_text" into the follwing line.
csv_text = File.read("#{Rails.public_path}/careersfair.csv")encode("UTF-8", invalid: :replace, undef: :replace, replace: "?")
If you couldn't solve that by this way, I assume the csv file is not UTF-8.
If you're on Linux, try file -i filename.txt. You can see encoding of the file.
require "iconv"
conv = Iconv.new("UTF-8//IGNORE","ENCODING_OF_YOUR_FILE")
csv_text = File.read("#{Rails.public_path}/careersfair.csv")
text = conv.iconv(csv_text)

Related

Ruby/Nokogiri site scraping - invalid byte sequence in UTF-8 (ArgumentError)

ruby n00b here. I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? (they are all standard 'http://www.exmaple.com/page' and work in my browser)?
Have tried .parse and .encode from similar threads on here, but no luck. Thanks for reading.
The code:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
end
puts "scraping done!"
The error message:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'
Two things:
You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls
The issue seems to be the encoding of the file listOfURLs.xls, ruby assumes that the file is UTF-8 encoded. If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error.
You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters.
If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1:
f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row|
puts row
end
Some good info about invalid byte sequences in UTF-8
Update:
An example:
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
I'm a bit late to the party here, but this should work for anyone running into the same issue in the future:
csv_doc = IO.read(file).force_encoding('ISO-8859-1').encode('utf-8', replace: nil)

Getting wrong number of arguments (1 for 2), but I believe I'm passing two

I'm getting a wrong number of arguments (1 for 2) even though I'm passing two arguments.
I am attempting to anyways..
Here is the app trace.
app/controllers/ips_dashboard_controller.rb:6:in `initialize'
app/controllers/ips_dashboard_controller.rb:82:in `new'
app/controllers/ips_dashboard_controller.rb:82:in `block (2 levels) in ips_dashboard'
app/controllers/ips_dashboard_controller.rb:81:in `each'
app/controllers/ips_dashboard_controller.rb:81:in `block in ips_dashboard'
app/controllers/ips_dashboard_controller.rb:74:in `each'
app/controllers/ips_dashboard_controller.rb:74:in `ips_dashboard'
Here I'm looking up ip addresses in the db and passing the array to the IP_data class to use in Seer::visualize.
# Begin lookups for tgt addresses
target_ip_data = Array.new
#tgt_ip_array = Array.new
#events.each do |ip_event|
def get_target_ip(sid,cid)
IpsIpHdr.where('sid =? and cid =?', sid, cid).first.ip_dst
end
tgt_ip = get_target_ip(ip_event.sid, ip_event.cid).to_s(16).rjust(8,'0').scan(/.{2}/).map(&:hex).join('.')
target_ip_data.push(tgt_ip)
#tgt_ip_hash = Hash[target_ip_data.group_by {|x| x}.map {|k,v| [k,v.count]}]
#tgt_ip_hash.each do |t|
#tgt_ip_array.push(IP_data.new(:ip => t[0],:count => t[1]))
end
end
# End lookups for tgt addresses
I also tried this, but also got an error. undefined method 'ip' for ["172.31.251.13", 24]:Array
# Begin lookups for tgt addresses
target_ip_data = Array.new
#tgt_ip_array = Array.new
#events.each do |ip_event|
def get_target_ip(sid,cid)
IpsIpHdr.where('sid =? and cid =?', sid, cid).first.ip_dst
end
tgt_ip = get_target_ip(ip_event.sid, ip_event.cid).to_s(16).rjust(8,'0').scan(/.{2}/).map(&:hex).join('.')
target_ip_data.push(tgt_ip)
#tgt_ip_hash = Hash[target_ip_data.group_by {|x| x}.map {|k,v| [k,v.count]}]
#tgt_ip_hash.each do |t|
IP_data.new(t[0],t[1])
end
end
# End lookups for tgt addresses
This is the error
undefined method `ip' for ["172.31.251.13", 24]:Array
Extracted source (around line #186):
183:
184: <%=
185: if #tgt_ip_hash.count > 0
186: raw Seer::visualize(
187: #tgt_ip_hash,
188: :as => :pie_chart,
189: :in_element => 'tgt_pie_chart',
Here is the class
class IP_data
attr_accessor :ip, :count
def initialize(ip, count)
#ip = ip
#count = count
end
end
You are actually sending a hash:
IP_data.new({:ip => t[0],:count => t[1]})
Just do:
IP_data.new(t[0],t[1])
You still need the previous loop (you deleted the #tgt_ip_arrat.push in the loop), change to this:
#tgt_ip_hash.each do |t|
#tgt_ip_array.push(IP_data.new(t[0],t[1]))
end

Encoding::UndefinedConversionError: "\xA8" from ASCII-8BIT to UTF-8 (SFTP)

Using the Net-SFTP gem, Ruby 2 and Rails 4
I wrote code that was working in pure ruby, but copied my code over to rails and now I get:
Encoding::UndefinedConversionError: "\xA8" from ASCII-8BIT to UTF-8
What can I change in my code to get this working?
def self.get_recent_file(ftp_file, local_file)
Net::SFTP.start(Config::A_FTP[:domain], Config::A_FTP[:username], :password => Config::A_FTP[:password]) do |sftp|
sftp.download!(ftp_file, local_file)
end
end
Log
Encoding::UndefinedConversionError: "\xA8" from ASCII-8BIT to UTF-8
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/operations/download.rb:339:in `write'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/operations/download.rb:339:in `write'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/operations/download.rb:339:in `on_read'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/request.rb:87:in `call'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/request.rb:87:in `respond_to'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:948:in `dispatch_request'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:911:in `when_channel_polled'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/channel.rb:311:in `call'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/channel.rb:311:in `process'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:222:in `block in preprocess'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:222:in `each'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:222:in `preprocess'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:205:in `process'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:169:in `block in loop'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:169:in `loop'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:169:in `loop'
... 13 levels...
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:222:in `preprocess'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:205:in `process'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:169:in `block in loop'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:169:in `loop'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-ssh-2.8.0/lib/net/ssh/connection/session.rb:169:in `loop'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:802:in `loop'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp/session.rb:787:in `connect!'
from /usr/local/opt/rbenv/versions/2.1.0-rc1/lib/ruby/gems/2.1.0/gems/net-sftp-2.1.2/lib/net/sftp.rb:32:in `start'
Code referenced in log from GEM:
https://github.com/net-ssh/net-sftp/blob/master/lib/net/sftp/operations/download.rb#L339
# Called when a read from a file finishes. If the read was successful
# and returned data, this will call #download_next_chunk to read the
# next bit from the file. Otherwise the file will be closed.
def on_read(response)
entry = response.request[:entry]
if response.eof?
update_progress(:close, entry)
entry.sink.close
request = sftp.close(entry.handle, &method(:on_close))
request[:entry] = entry
elsif !response.ok?
raise "read #{entry.remote}: #{response}"
else
entry.offset += response[:data].bytesize
update_progress(:get, entry, response.request[:offset], response[:data])
entry.sink.write(response[:data]) # <~~ Line#339
download_next_chunk(entry)
end
end
This helps me:
def self.get_recent_file(ftp_file, local_file)
local_io = File.new(local_file, mode: 'w', encoding: 'ASCII-8BIT')
Net::SFTP.start(Config::A_FTP[:domain], Config::A_FTP[:username], :password => Config::A_FTP[:password]) do |sftp|
sftp.download!(ftp_file, local_io)
end
local_io.close
end
A combination of user72136's answer and the answer to this question worked for me (my remote file wasn't even ASCII):
def self.get_recent_file(ftp_file, local_file)
local_io = File.new(local_file, mode: 'wb')
Net::SFTP.start(Config::A_FTP[:domain], Config::A_FTP[:username], :password => Config::A_FTP[:password]) do |sftp|
sftp.download!(ftp_file, local_io)
end
local_io.close
end
As line#339 is showing
entry.sink.write(response[:data])
Fix it as :
entry.sink.write(response[:data].force_encoding('ASCII-8BIT').encode('UTF-8'))
Change the line -
sftp.download!(ftp_file, local_file)
to say
sftp.download!(ftp_file, local_file).to_s.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
This problem is being produced by how Ruby opens text-files by default after Ruby 2.0 version with UTF-8 encoding. Where you open your file you can put:
local_file = Tempfile.new(encoding: 'ascii-8bit')
#or another thing to do is to switch to binary-mode
local_file = Tempfile.new
local_file.binmode
You can also open a binary-file like this:
local_file = File.open('/tmp/local_file', 'wb')
Another solution you can do is to pass to the gem-code the filepath, instead of an open file:
def self.get_recent_file(ftp_file, local_file)
Net::SFTP.start(Config::A_FTP[:domain], Config::A_FTP[:username], :password => Config::A_FTP[:password]) do |sftp|
sftp.download!(ftp_file, local_file.path)
end
end

Ruby csv import trouble

I tried to import data from csv in my rails app, but something went wrong:
CSV::MalformedCSVError in ArticlesController#index
Unclosed quoted field on line 1.
My csv looks like this:
"Код";"№ по каталогу (артикул)";"Наименование товара";"Ед. изм.";"Цена опт.";"Доп.";"Остатки";"Класс";"Группа";"Бренд";"Блок."
2223;15-562-44;15-562-44 (27-B07-F) VW Polo 95-R ;шт ;37,430;;;Амортизаторы ;Амортизаторы BOGE ;;
10327;24-052-1;24-052-1(46-A27-0) LAND ROVER 84- F ;шт ;68,750;;;Амортизаторы ;Амортизаторы BOGE ;;
10328;24-053-1;24-053-1(46-A28-0) LAND ROVER 84- R ;шт ;68,750;;;Амортизаторы ;Амортизаторы BOGE ;;
Maybe this is because of the first line (which has no ;;)
My code look like this:
def csv_import
require 'csv'
file = File.open("/#{Rails.public_path}/uploads/smallcsv.csv")
#csv = CSV.parse(file)
csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => ";;", :headers => :first_row})
file_path = "/#{Rails.public_path}/uploads/smallcsv.csv"
##parsed_file=CSV::Reader.parse(file_path)
csv.each do |row|
ename = row[2]
eprice = row[5]
eqnt = row[7]
esupp = row[10]
logger.warn(ename)
end
end
I'm running ruby 1.9+ with fastercsv gem
I figured this out myself using "CSV - Unquoted fields do not allow \r or \n (line 2)".
The problem was with the first line, so :auto helped me.

How would I find similar lines in two CSV files?

Here is my code but it takes forever for huge files:
require 'rubygems'
require "faster_csv"
fname1 =ARGV[0]
fname2 =ARGV[1]
if ARGV.size!=2
puts "Display common lines in the two files \n Usage : ruby user_in_both_files.rb <file1> <file2> "
exit 0
end
puts "loading the CSV files ..."
file1=FasterCSV.read(fname1, :headers => :first_row)
file2=FasterCSV.read(fname2, :headers => :first_row)
puts "CSV files loaded"
#puts file2[219808].to_s.strip.gsub(/\s+/,'')
lineN1=0
lineN2=0
# count how many common lines
similarLines=0
file1.each do |line1|
lineN1=lineN1+1
#compare line 1 to all line from file 2
lineN2=0
file2.each do |line2|
puts "file1:l#{lineN1}|file2:l#{lineN2}"
lineN2=lineN2+1
if ( line1.to_s.strip.gsub(/\s+/,'') == line2.to_s.strip.gsub(/\s+/,'') )
puts "file1:l#{line1}|file2:l#{line2}->#{line1}\n"
similarLines=similarLines+1
end
end
end
puts "#{similarLines} similar lines."
Ruby has set operations available with arrays:
a_ary = [1,2,3]
b_ary = [3,4,5]
a_ary & b_ary # => 3
So, from that you should try:
puts "loading the CSV files ..."
file1 = FasterCSV.read(fname1, :headers => :first_row)
file2 = FasterCSV.read(fname2, :headers => :first_row)
puts "CSV files loaded"
common_lines = file1 & file2
puts common_lines.size
If you need to preprocess the arrays, do it as you load them:
file1 = FasterCSV.read(fname1, :headers => :first_row).map{ |l| l.to_s.strip.gsub(/\s+/, '') }
file2 = FasterCSV.read(fname2, :headers => :first_row).map{ |l| l.to_s.strip.gsub(/\s+/, '') }
You're gsubing File2 every time you loop through File1. I'd do that first, and then just compare the results of that.
Edit Something like this (untested)
file1lines = []
file1.each do |line1|
file1lines = line1.strip.gsub(/\s+/, '')
end
# Do the same for `file2lines`
file1lines.each do |line1|
lineN1=lineN1+1
#compare line 1 to all line from file 2
lineN2=0
file2lines.each do |line2|
puts "file1:l#{lineN1}|file2:l#{lineN2}"
lineN2=lineN2+1
if ( line1 == line2 )
puts "file1:l#{line1}|file2:l#{line2}->#{line1}\n"
similarLines=similarLines+1
end
end
end
I'd also get rid of all the putses in the loops unless you really need them. If the files are huge, that's probably slowing it down a noticeable amount.

Resources