Ruby/Nokogiri site scraping - invalid byte sequence in UTF-8 (ArgumentError)

ruby n00b here. I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? (they are all standard '' and work in my browser)?
Have tried .parse and .encode from similar threads on here, but no luck. Thanks for reading.
The code:
require 'csv'
require 'nokogiri'
require 'open-uri'
:write_headers => true,
:headers => %w[url desc]
}'myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
puts "scraping done!"
The error message:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'

Two things:
You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls
The issue seems to be the encoding of the file listOfURLs.xls, ruby assumes that the file is UTF-8 encoded. If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error.
You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters.
If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1:
f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row|
puts row
Some good info about invalid byte sequences in UTF-8
An example:'myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]

I'm a bit late to the party here, but this should work for anyone running into the same issue in the future:
csv_doc ='ISO-8859-1').encode('utf-8', replace: nil)


IO.pipe(Encoding::BINARY, Encoding::BINARY) failing with UndefinedConversionError, but only under Rails

I have some code for converting from http.rb's chunk-based response body streaming to an ordinary IO.
def stream_response_body(body)
IO.pipe(Encoding::BINARY, Encoding::BINARY) do |rd, wr|
t = copying_thread(body, wr)
yield rd
t.join if t
def copying_thread(body, dst) do
body.each { |chunk| dst.write(chunk) }
rescue StandardError => e
This works fine when I call it from a command-line script, but when I call it from a Rails controller, dst.write(chunk) blows up with:
Encoding::UndefinedConversionError ("\xE5" from ASCII-8BIT to UTF-8):
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:106:in `write'
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:106:in `block (2 levels) in copying_thread'
(Script and Rails app are both running under Ruby 2.7.2 on macOS Catalina.)
I've stripped the reading code down to reading byte-by-byte just to make sure the issue wasn't being caused by some downstream library:
response = HTTP.get(url, encoding: Encoding::BINARY)
status = response.status
raise(HTTP::ResponseError, status.to_s) unless status.success?
xml_str_io =
stream_response_body(response.body) do |body|
while (b =
Why (and where!) is the ASCII-8BIT to UTF-8 conversion happening at all? And why only when called from Rails?
I tried the following modifications, neither of which worked:
packed byte array instead of raw string
body.each do |chunk|
byteStr = chunk.bytes.pack('C*')
use putc instead of write
body.each do |chunk|
chunk.bytes.each do |b|
Interestingly, in the second case, I still see write in the backtrace:
Encoding::UndefinedConversionError ("\xE5" from ASCII-8BIT to UTF-8):
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:108:in `write'
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:108:in `putc'
/Users/david/.rvm/gems/ruby-2.7.2/bundler/gems/ucblit-tind-de599ab253cc/lib/ucblit/tind/api/api.rb:108:in `block (3 levels) in copying_thread'
I assume this failing write (and probably the others) is in IO's C code somewhere?
Rails sets the default encoding to UTF8
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
I believe you need to set the encoding on your writer pipe otherwise it will use the default encoding.
read_io, write_io = IO.pipe(Encoding::BINARY, Encoding::BINARY, binmode: true)
write_io.write([serialized_object].pack('NA*'), encoding: 'BINARY')

Invalid Byte Sequence in UTF-8 from Excel file

(Ruby 2.5) I have a method that reads and parses a csv file that's being uploaded via Alchemy CMS
def process_csv(csv_file, current_user_id, original_filename)
errors = []
index = 0
string_converter = lambda { |field| field.strip }
total = CSV.foreach(csv_file, headers: true).count
csv_string =!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string, headers: true, header_converters: :symbol, skip_blanks: true, converters: [string_converter] ) do |row|
# do other stuff
but when I try to upload a csv file that has a column (name) with a string that contains special characters then I receive the Invalid Byte Sequence in UTF-8 error. I'm trying to test the value N'öt Réal Stô'rë.
I've tried a few solutions that I found on the web but no luck - any suggestions?
It's unclear what your csv_fileis. I guess it is a File-object.
Sometimes I got csv from Excel as a UTF-16. So let's try an example:
I have a csv-file stored in UTF-16BE with the following content:
1;Das ist UTF-16 BE;Ä
If I execute the following code:
require 'csv'
def process_csv(csv_file)
csv_string =!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string, headers: true, skip_blanks: true, col_sep: ';') do |row|
p row # do other stuff
then I get also a Invalid byte sequence in UTF-8-error.
If I use
process_csv('example_utf16BE.txt', 'rb', encoding: 'BOM|utf-16BE'))
then everything works.
So I guess, you get a File-object in a wron encoding and the code!("UTF-8", "iso-8859-1", invalid: :replace) is a code part to repair this problem.
What you can do:
Add to you code:
p csv_file
p csv_file.external_encoding
You should get
Now check, if the file (in my example: example_utf16BE.txt has really the encoding of the 2nd line.
If not, try to adapt the File-object creation.
If this is not possible, then you can try to use csv_file.set_encoding 'utf-8' to change the encoding before you read the content.

In Ruby, how do I deal with non-UTF 8 characters in PDF content?

I’m using Rails 4.2.7. I’m downloading and writing PDF content from the web, like so …
res1 = Net::HTTP.SOCKSProxy('', 50001).start(, uri.port) do |http|
puts "launching #{uri}"
resp = http.get(uri)
status = resp.code
content = resp.body
content_type = resp['content-type']
content_encoding = resp['content-encoding']
if content_type == 'application/pdf' || content_type.include?('application/x-javascript'), "w") { |file| file.write content }
I’m noticing that for some content, I get the below error
Error during processing: "\xC2" from ASCII-8BIT to UTF-8
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `write'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `block in pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `open'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:76:in `process_race_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_race_finder_service.rb:75:in `process_race_link'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:29:in `block in process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `each'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord- `each'
I tried accounting for it, by replacing invalid characters, like so …, "w") { |file| file.write content }
content.encode('UTF-8', :invalid => :replace, :undef => :replace)
but then I get the error
error: PDF malformed, expected 'endstream' but found 0 instead
when trying to read the PDF file. Does anyone know of a better way to deal with downloaded PDF docs that won’t corrupt them?
I think the easiest solution would be to write it as is using IO#binwrite:
File.binwrite(file_location, content)
The above might fail, if files you receive might be in different encodings, In that case I would try to

CSV importing in Rails - invalid byte sequence in UTF-8 with non-english characters

I'm using the CSVMapper Gem to import some records in a CSV file to a Rails 3 model. (I used this gem because it is what I've found the easiest way to do this)
Anyway, the code I'm using to import the records is the following:
r = import('doc/socios_full.csv') do
map_to Associate
after_row lambda{|row, associate| }
start_at_row 1
#The previous line is actually longer, with more atts, but it's been cut to explain the example
And it works very well, except when the parser encounters some non-english characters, like ó, é, ñ, í, °.... That's when I get the following error:
ArgumentError: invalid byte sequence in UTF-8
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1831:in `sub!'
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1831:in `block in shift'
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1825:in `loop'
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1825:in `shift'
from /home/bcb/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/csv.rb:1767:in `each'
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/csv-mapper-0.5.1/lib/csv-mapper.rb:106:in `each_with_index'
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/csv-mapper-0.5.1/lib/csv-mapper.rb:106:in `import'
from (irb):63
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands/console.rb:44:in `start'
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands/console.rb:8:in `start'
from /home/bcb/.rvm/gems/ruby-1.9.2-p136/gems/railties-3.0.9/lib/rails/commands.rb:23:in `<top (required)>'
from script/rails:6:in `require'
from script/rails:6:in `<main>'
I'm really certain of this because if I replace all of these characters, the problem goes away until the parser finds another non-english character. The thing is that I have a 50k records file, so searching for each character I can think of and trying to import all of these records every time is very time consuming.
Is there a way to ignore these errors and allow the parser to go on? Or is there an easier way to import this CSV file?
Do it like this:
CSV.foreach(filename, :headers => true , :encoding => 'ISO-8859-1') do |row|
I had the same problem trying to read in a CSV file saved via MS Excel. You can specify the encoding as an option. I guess it assumes UTF-8 by default.
Solved it with a different approach, this is a much easier solution for importing CSV files into a Rails 3 model than using an external gem:
require 'csv'
CSV.foreach('doc/socios_full.csv') do |row|
record =
:media_format => row[0],
:group => row[0],
:member => row[1],
:family_relationship_code => row[2],
:family_relationship_description => row[3],
:last_name => row[4],
:names => row[5],
It works flawlessly, even with non-english characters (just tried a 75k import file!). Hope it's helpful for someone.
Maybe, you can try something like this:
The following approach should work in any model assuming you are confident that the CSV will contain the correct header names:
def self.import(file)
CSV.foreach(file.path, headers: true) do |row|
obj =
obj.attributes.each_key do |attribute|
index = row.headers.index(attribute)
obj.send("#{attribute}=",row[index]) if index

How to change the encoding during CSV parsing in Rails

I would like to know how can I change the encoding of my CSV file when I import it and parse it. I have this code:
csv = CSV.parse(output, :headers => true, :col_sep => ";")
csv.each do |row|
row = row.to_hash.with_indifferent_access
When I read my file, I get this error:
Encoding::CompatibilityError in FileImportingController#load_file
incompatible character encodings: ASCII-8BIT and UTF-8
I read about row.force_encoding('utf-8') but it does not work:
NoMethodError in FileImportingController#load_file
undefined method `force_encoding' for #<ActiveSupport::HashWithIndifferentAccess:0x2905ad0>
I had to read CSV files encoded in ISO-8859-1.
Doing the documented
CSV.foreach(filename, encoding:'iso-8859-1:utf-8', col_sep: ';', headers: true) do |row|
threw the exception
ArgumentError: invalid byte sequence in UTF-8
from csv.rb:2027:in '=~'
from csv.rb:2027:in 'init_separators'
from csv.rb:1570:in 'initialize'
from csv.rb:1335:in 'new'
from csv.rb:1335:in 'open'
from csv.rb:1201:in 'foreach'
so I ended up reading the file and converting it to UTF-8 while reading, then parsing the string:
CSV.parse(, 'r:iso-8859-1:utf-8'){|f|}, col_sep: ';', headers: true, header_converters: :symbol) do |row|
pp row
force_encoding is meant to be run on a string, but it looks like you're calling it on a hash. You could say:
csv = CSV.parse(output, :headers => true, :col_sep => ";")
Hey I wrote a little blog post about what I did, but it's slightly more verbose than what's already been posted. For whatever reason, I couldn't get those solutions to work and this did.
This gist is that I simply replace (or in my case, remove) the invalid/undefined characters in my file then rewrite it. I used this method to convert the files:
def convert_to_utf8_encoding(original_file)
original_string =
final_string = original_string.encode(invalid: :replace, undef: :replace, replace: '') #If you'd rather invalid characters be replaced with something else, do so here.
final_file ='import') #No need to save a real File
final_file.close #Don't forget me
Hope this helps.
Edit: No destination encoding is specified here because encode assumes that you're encoding to your default encoding which for most Rails applications is UTF-8 (I believe)
