Ruby/Nokogiri site scraping - invalid byte sequence in UTF-8 (ArgumentError) - ruby-on-rails

ruby n00b here. I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? (they are all standard 'http://www.exmaple.com/page' and work in my browser)?
Have tried .parse and .encode from similar threads on here, but no luck. Thanks for reading.
The code:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
end
puts "scraping done!"
The error message:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'

Two things:
You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls
The issue seems to be the encoding of the file listOfURLs.xls, ruby assumes that the file is UTF-8 encoded. If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error.
You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters.
If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1:
f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row|
puts row
end
Some good info about invalid byte sequences in UTF-8
Update:
An example:
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end

I'm a bit late to the party here, but this should work for anyone running into the same issue in the future:
csv_doc = IO.read(file).force_encoding('ISO-8859-1').encode('utf-8', replace: nil)

Related

IO.pipe(Encoding::BINARY, Encoding::BINARY) failing with UndefinedConversionError, but only under Rails

I have some code for converting from http.rb's chunk-based response body streaming to an ordinary IO.
def stream_response_body(body)
IO.pipe(Encoding::BINARY, Encoding::BINARY) do |rd, wr|
t = copying_thread(body, wr)
yield rd
ensure
t.join if t
end
end
def copying_thread(body, dst)
Thread.new do
body.each { |chunk| dst.write(chunk) }
rescue StandardError => e
UCBLIT::TIND.logger.error(e)
ensure
dst.close
Thread.exit
end
end
This works fine when I call it from a command-line script, but when I call it from a Rails controller, dst.write(chunk) blows up with:
Encoding::UndefinedConversionError ("\xE5" from ASCII-8BIT to UTF-8):
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:106:in `write'
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:106:in `block (2 levels) in copying_thread'
(Script and Rails app are both running under Ruby 2.7.2 on macOS Catalina.)
I've stripped the reading code down to reading byte-by-byte just to make sure the issue wasn't being caused by some downstream library:
response = HTTP.get(url, encoding: Encoding::BINARY)
status = response.status
raise(HTTP::ResponseError, status.to_s) unless status.success?
xml_str_io = StringIO.new
xml_str_io.set_encoding(Encoding::BINARY)
stream_response_body(response.body) do |body|
while (b = body.read(1))
xml_str_io.putc(b)
end
end
Why (and where!) is the ASCII-8BIT to UTF-8 conversion happening at all? And why only when called from Rails?
Update:
I tried the following modifications, neither of which worked:
packed byte array instead of raw string
body.each do |chunk|
byteStr = chunk.bytes.pack('C*')
dst.write(byteStr)
end
use putc instead of write
body.each do |chunk|
chunk.bytes.each do |b|
dst.putc(b)
end
end
Interestingly, in the second case, I still see write in the backtrace:
Encoding::UndefinedConversionError ("\xE5" from ASCII-8BIT to UTF-8):
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:108:in `write'
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:108:in `putc'
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:108:in `block (3 levels) in copying_thread'
I assume this failing write (and probably the others) is in IO's C code somewhere?
Rails sets the default encoding to UTF8
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
https://github.com/rails/rails/blob/291a3d2ef29a3842d1156ada7526f4ee60dd2b59/railties/lib/rails.rb#L22-L23
I believe you need to set the encoding on your writer pipe otherwise it will use the default encoding.
read_io, write_io = IO.pipe(Encoding::BINARY, Encoding::BINARY, binmode: true)
write_io.set_encoding(Encoding::BINARY)
write_io.write([serialized_object].pack('NA*'), encoding: 'BINARY')

Invalid Byte Sequence in UTF-8 from Excel file

(Ruby 2.5) I have a method that reads and parses a csv file that's being uploaded via Alchemy CMS
def process_csv(csv_file, current_user_id, original_filename)
lock_importer
errors = []
index = 0
string_converter = lambda { |field| field.strip }
total = CSV.foreach(csv_file, headers: true).count
csv_string = csv_file.read.encode!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string, headers: true, header_converters: :symbol, skip_blanks: true, converters: [string_converter] ) do |row|
# do other stuff
end
but when I try to upload a csv file that has a column (name) with a string that contains special characters then I receive the Invalid Byte Sequence in UTF-8 error. I'm trying to test the value N'öt Réal Stô'rë.
I've tried a few solutions that I found on the web but no luck - any suggestions?
It's unclear what your csv_fileis. I guess it is a File-object.
Sometimes I got csv from Excel as a UTF-16. So let's try an example:
I have a csv-file stored in UTF-16BE with the following content:
line;comment;UmlautÄ
1;Das ist UTF-16 BE;Ä
2;öüäÖÄÜ;Ä
If I execute the following code:
require 'csv'
def process_csv(csv_file)
csv_string = csv_file.read#.encode!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string, headers: true, skip_blanks: true, col_sep: ';') do |row|
p row # do other stuff
end
end
process_csv(File.open('example_utf16BE.txt'))
then I get also a Invalid byte sequence in UTF-8-error.
If I use
process_csv(File.open('example_utf16BE.txt', 'rb', encoding: 'BOM|utf-16BE'))
then everything works.
So I guess, you get a File-object in a wron encoding and the code csv_file.read.encode!("UTF-8", "iso-8859-1", invalid: :replace) is a code part to repair this problem.
What you can do:
Add to you code:
p csv_file
p csv_file.external_encoding
You should get
#<File:example_utf16BE.txt>
#<Encoding:UTF-16BE>
Now check, if the file (in my example: example_utf16BE.txt has really the encoding of the 2nd line.
If not, try to adapt the File-object creation.
If this is not possible, then you can try to use csv_file.set_encoding 'utf-8' to change the encoding before you read the content.

In Ruby, how do I deal with non-UTF 8 characters in PDF content?

I’m using Rails 4.2.7. I’m downloading and writing PDF content from the web, like so …
res1 = Net::HTTP.SOCKSProxy('127.0.0.1', 50001).start(uri.host, uri.port) do |http|
puts "launching #{uri}"
resp = http.get(uri)
status = resp.code
content = resp.body
content_type = resp['content-type']
content_encoding = resp['content-encoding']
end
…
if content_type == 'application/pdf' || content_type.include?('application/x-javascript')
File.open(file_location, "w") { |file| file.write content }
I’m noticing that for some content, I get the below error
Error during processing: "\xC2" from ASCII-8BIT to UTF-8
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `write'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `block in pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `open'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:76:in `process_race_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_race_finder_service.rb:75:in `process_race_link'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:29:in `block in process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `each'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'
I tried accounting for it, by replacing invalid characters, like so …
File.open(file_location, "w") { |file| file.write content }
content.encode('UTF-8', :invalid => :replace, :undef => :replace)
but then I get the error
error: PDF malformed, expected 'endstream' but found 0 instead
when trying to read the PDF file. Does anyone know of a better way to deal with downloaded PDF docs that won’t corrupt them?
I think the easiest solution would be to write it as is using IO#binwrite:
File.binwrite(file_location, content)
The above might fail, if files you receive might be in different encodings, In that case I would try to
content.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)

CSV importing in Rails - invalid byte sequence in UTF-8 with non-english characters

I'm using the CSVMapper Gem to import some records in a CSV file to a Rails 3 model. (I used this gem because it is what I've found the easiest way to do this)
Anyway, the code I'm using to import the records is the following:
r = import('doc/socios_full.csv') do
map_to Associate
after_row lambda{|row, associate| associate.save }
start_at_row 1
[group,member,family_relationship_code,family_relationship_description,last_name,names,...]
#The previous line is actually longer, with more atts, but it's been cut to explain the example
end
And it works very well, except when the parser encounters some non-english characters, like ó, é, ñ, í, °.... That's when I get the following error:
ArgumentError: invalid byte sequence in UTF-8
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1831:in `sub!'
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1831:in `block in shift'
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1825:in `loop'
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1825:in `shift'
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1767:in `each'
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/csv-mapper-0.5.1/lib/csv-mapper.rb:106:in `each_with_index'
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/csv-mapper-0.5.1/lib/csv-mapper.rb:106:in `import'
from (irb):63
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands/console.rb:44:in `start'
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands/console.rb:8:in `start'
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands.rb:23:in `<top (required)>'
from script/rails:6:in `require'
from script/rails:6:in `<main>'
I'm really certain of this because if I replace all of these characters, the problem goes away until the parser finds another non-english character. The thing is that I have a 50k records file, so searching for each character I can think of and trying to import all of these records every time is very time consuming.
Is there a way to ignore these errors and allow the parser to go on? Or is there an easier way to import this CSV file?
Do it like this:
CSV.foreach(filename, :headers => true , :encoding => 'ISO-8859-1') do |row|
I had the same problem trying to read in a CSV file saved via MS Excel. You can specify the encoding as an option. I guess it assumes UTF-8 by default.
Solved it with a different approach, this is a much easier solution for importing CSV files into a Rails 3 model than using an external gem:
require 'csv'
CSV.foreach('doc/socios_full.csv') do |row|
record = Associate.new(
:media_format => row[0],
:group => row[0],
:member => row[1],
:family_relationship_code => row[2],
:family_relationship_description => row[3],
:last_name => row[4],
:names => row[5],
...
)
record.save!
end
It works flawlessly, even with non-english characters (just tried a 75k import file!). Hope it's helpful for someone.
Maybe, you can try something like this:
csv_string.force_encoding('ISO-8859-1')
The following approach should work in any model assuming you are confident that the CSV will contain the correct header names:
def self.import(file)
CSV.foreach(file.path, headers: true) do |row|
obj = self.new
obj.attributes.each_key do |attribute|
index = row.headers.index(attribute)
obj.send("#{attribute}=",row[index]) if index
end
obj.save
end
end

How to change the encoding during CSV parsing in Rails

I would like to know how can I change the encoding of my CSV file when I import it and parse it. I have this code:
csv = CSV.parse(output, :headers => true, :col_sep => ";")
csv.each do |row|
row = row.to_hash.with_indifferent_access
insert_data_method(row)
end
When I read my file, I get this error:
Encoding::CompatibilityError in FileImportingController#load_file
incompatible character encodings: ASCII-8BIT and UTF-8
I read about row.force_encoding('utf-8') but it does not work:
NoMethodError in FileImportingController#load_file
undefined method `force_encoding' for #<ActiveSupport::HashWithIndifferentAccess:0x2905ad0>
Thanks.
I had to read CSV files encoded in ISO-8859-1.
Doing the documented
CSV.foreach(filename, encoding:'iso-8859-1:utf-8', col_sep: ';', headers: true) do |row|
threw the exception
ArgumentError: invalid byte sequence in UTF-8
from csv.rb:2027:in '=~'
from csv.rb:2027:in 'init_separators'
from csv.rb:1570:in 'initialize'
from csv.rb:1335:in 'new'
from csv.rb:1335:in 'open'
from csv.rb:1201:in 'foreach'
so I ended up reading the file and converting it to UTF-8 while reading, then parsing the string:
CSV.parse(File.open(filename, 'r:iso-8859-1:utf-8'){|f| f.read}, col_sep: ';', headers: true, header_converters: :symbol) do |row|
pp row
end
force_encoding is meant to be run on a string, but it looks like you're calling it on a hash. You could say:
output.force_encoding('utf-8')
csv = CSV.parse(output, :headers => true, :col_sep => ";")
...
Hey I wrote a little blog post about what I did, but it's slightly more verbose than what's already been posted. For whatever reason, I couldn't get those solutions to work and this did.
This gist is that I simply replace (or in my case, remove) the invalid/undefined characters in my file then rewrite it. I used this method to convert the files:
def convert_to_utf8_encoding(original_file)
original_string = original_file.read
final_string = original_string.encode(invalid: :replace, undef: :replace, replace: '') #If you'd rather invalid characters be replaced with something else, do so here.
final_file = Tempfile.new('import') #No need to save a real File
final_file.write(final_string)
final_file.close #Don't forget me
final_file
end
Hope this helps.
Edit: No destination encoding is specified here because encode assumes that you're encoding to your default encoding which for most Rails applications is UTF-8 (I believe)

Resources