Problem with attachments' character encoding using gmail gem in ruby/rails - ruby-on-rails

What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!

Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).

It seems like you need to do attachment.body.decoded instead of attachment.decoded

Related

How to correctly handle character encoding when using Postgresql's copy_data function?

In my Rails app, I managed to stream large CSV files directly from Postgres based on solutions mentioned in this SO post. My working code looks somewhat like so:
query = <A Long SQL Query String>
response.headers["Cache-Control"] = "no-cache"
response.headers["Content-Type"] = "text/csv; charset=utf-8"
response.headers["Content-Disposition"] =
%(attachment; filename="#{csv_filename}")
response.headers["Last-Modified"] = Time.now.ctime.to_s
conn = ActiveRecord::Base.connection.raw_connection
conn.copy_data("COPY (#{query}) TO STDOUT WITH (FORMAT CSV, HEADER TRUE, FORCE_QUOTE *, ESCAPE E'\\\\');") do
while row = conn.get_copy_data
response.stream.write row
end
end
response.stream.close
end
Some of the columns (VARCHAR) being queried have values as either English or Chinese strings. The CSV file resulting from the above code doesn’t show the Chinese characters as is. Instead, I get something like this:
大大 文文
Am I supposed to change the way I’m using the copy_data function, or is there something I could do to the CSV file to solve this? I’ve tried saving the file as UTF-8 .txt file, as well as trying the convert_to function mentioned in the copy_data documentation, but to no avail.
This depends of the original encoding included in the CSV file.
Do this on Linux :
file -i you_file
Are you sure it's not UTF-16 or GB 18030 ?
And also in what kind of encoding is setup your database ?
do a \l in psql to see this.
So it boiled down to my MS Excel not being able to render the Chinese chars correctly. On MacOS, opening the same .csv file using the Numbers app (or even Atom, for that matter) resolved this issue for me.

RoR: Handling Blanks AND/OR Special Characters

I'm processing emails for upload and occasionally an embedded image in the email comes through either without a file extension or with an extension containing a random combination of letters, numbers and special characters (for example: image001.gif#01CFA02B.47556390). If either instance arrives, I want to ignore it and move on. I think I've got the without extension covered, but wasn't clear on how best to handle the random characters and well as the cleanest way in to write the conditionals. Here is what I have so far:
filename_extension = File.extname(filename)
if filename_extension.blank?
puts "FILENAME EXT IS BLANK"
elsif filename_extension #NEED REGEX or something to handle Random?
puts "FILENAME EXT IS Random"
else #DO PROCESSING
Thanks.
known_extensions = %w[.csv .rb .rbw .html .htm .css]
filenames = %w[1.txt 2.csv 3]
filenames.each do |filename|
filename_extension = File.extname(filename)
if filename_extension.empty?
puts "FILENAME EXT IS BLANK"
elsif known_extensions.include? filename_extension
puts "FILENAME EXT IS Random"
else #DO PROCESSING
puts "Processing"
end
end
The question was tagged ruby without any indicate of gems that may give you the blank? method.
The idea of an 'invalid' extension is rather varied, and of course tied to what it means to be a valid file name. On most Unix file systems, for example, the only limitations on an file name would the limitation of the filename size of 255 bytes, and the reserved characters of / and null. In fact, there is no specification that I am aware of about 'extensions' in Unix, as they are simply a part of a file name, the period in a file name being valid, not signifying anything special. (With the exception of a file name that starts with a period indicating that it should be a 'hidden' file.) On a Windows system, it is a longer list of characters, some of which are / : < > ? \ | + , . ; = [ ] as well as the single and double quotes. On my Commodore, I think it was :, , and =, and on my Amiga system I could use anything except for ", :, or /.
So I think 'invalid' extension might be easier to match than the 'valid' ones. If you are indeed using Rails, and hosting on Unix, then you have a smaller set of things to check for, to ensure a valid extension (indeed, a valid filename). Basing that invalid extension on your hosting system, and any restrictions you would place due to your idea of what a valid extension means to your program.

Is there a way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files are claimed to be UTF-8 encoded, but they clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:
tr -cd '^[:print:]' < original.xml > clean.xml
Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Ruby on Rails.
The problem is that we're deploying on Heroku, and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Ruby on Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:
data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
hash = node.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
data.push(newrow)
end
end
Running this on the raw file produces an error:
"Invalid byte sequence in UTF-8"
Here are all the helpful suggestions I've tried but all have failed.
Use Coder
Coder.clean!(data_string, "UTF-8")
Force Encoding
data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')
Convert to UTF-16 and back to UTF-8
data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
data_string.encode!('UTF-8', 'UTF-16')
Use valid_encoding?
data_string.chars.select{|i| i.valid_encoding?}.join
No characters are removed; generates "invalid byte sequence" errors.
Specify encoding on opening the file
I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (#file_encodings is an array of every possible file encoding):
#file_encodings.each do |enc|
print "#{enc}..."
conv_str = "r:#{enc}:utf-8"
begin
data_file = File.open(fname, conv_str)
data_string = data_file.read
rescue
data_file = nil
data_string = ""
end
data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")
unless data_string.blank? print "\n#{enc} detected!\n"
return data_string
end
Use Regexp to remove non-printables:
data_string.gsub!(/[^[:print:]]/,"")
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
(I also tried variants including /[^a-zA-Z0-9~`!##$%^&*()-_=+[{]}\|;:'",<.>/\?]/)
For all of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.
So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Ruby on Rails.
What I ended up doing is extremely inelegant, but it gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):
data_string.gsub!(/[Crazy Control Characters]/,"")
But the purist in me insists there should be a more elegant, general solution.
Ruby 2.1 has a new method called String.scrub which is exactly what you need.
If the string is invalid byte sequence then replace invalid bytes with
given replacement character, else returns self. If block is given,
replace invalid bytes with returned value of the block.
Check the documentation for more information.
I found this on Stack Overflow for some other question and this too worked fine for me. Assuming data_string is your XML:
data_string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Thanks for the responses. I did find something that works by testing all sorts of combinations of different tools. I hope this is helpful to other people who have shared the same frustration.
data_string.encode!("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "" )
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
As you can see, it's a combination of the "encode" method and a regexp to remove control characters (except for newlines).
My testing revealed that the file I was importing had TWO problems: (1) invalid UTF-8 byte sequences; and (2) unprintable control characters that forced Nokogiri to stop parsing before the end of the file. I had to fix both problems, in that order, otherwise gsub! throws the "invalid byte sequence" error.
Note that the first line in the code above could be substituted with EITHER of the following with the same successful result:
Coder.clean!(data_string,'UTF-8')
or
data_string.scrub!("")
This worked perfectly for me.
Try using a combination of force_encoding("ISO-8859-1") and encode("utf-8"):
data_string.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
This helped me once.

rails4 / emails and special characters

On some app' setup
Everything works fine within the email, in the body
There is a problem with the subject of this email, containing special characters,
which would output the following for exemple in the mailboxes :
a été validée !
Searching around, I ended up with the following that works :
s = ("...a été validée !...").encode!("ISO-8859-15")
m = mail(to: email, subject: s)
But I guess that it is just a setup thing that would make everything works fine
Has anyone any experience about it ?
Can you try this in your application.rb?
config.action_mailer.default_charset = "iso-8859-15"
I'm using the default (utf-8) and accents work just fine, so you may want to see if there's something else that's botching up your data.
UPDATE: since this is only an issue with your subject, I think you may be able to fix this by adding this to the top of your source file that contains that subject line:
# encoding: UTF-8
Please try that out and let me know.

MalformedCSVError with rails CSV (FasterCSV)

I'm having serious issues trying to parse some CSV in rails right now.
Basically my app gets a user to upload a CSV file. The app then converts the file to ensure it is in UTF-8 format, then attempts to parse it and process it. Whenever the app attempts to parse it however, I get the MalformedCSVError stating "Illegal quoting on line 1"
Now what I don't get, is if I copy the original file into a new document and save it, then I can parse it on a rails console without a problem.
If I attempt to parse the original file, it complains about an invalid character for UTF-8 encoding (the file isn't in UTF-8 hence the app converts it)
If I attempt to parse the file which the app has converted to UTF-8 and changed the line endings to LF, it fails to parse.
If I do a file diff between the version the app has produced, and the copy/paste version that I have made (which works) there are 0 differences so I really can't figure out why one is parsable, and one is not.
Any suggestions? My app is processing the file as follows :
def create
#survey = Survey.new(params[:survey])
# Now we need to try and convert this to UTF-8 if it isn't already
encoded = File.read(#survey.survey_data.current_path)
encoding = CharlockHolmes::EncodingDetector.detect(encoded)
# We've got a guess at the encoding,
# so we can try and convert it but it
# may still fail so we need to handle
# that
begin
re_encoded = CharlockHolmes::Converter.convert(encoded, encoding[:encoding], 'UTF-8')
re_encoded = re_encoded.gsub(/\r\n?/, "\n")
# Now replace the uploaded file
File.open(#survey.survey_data.current_path, 'w') { |f|
f.write(re_encoded)
}
rescue ArgumentError
puts "UH OH!!!!!"
end
puts "#{#survey.survey_data.current_path}"
#parsed = CSV.read(#survey.survey_data.current_path)
end
The file uploading gem is CarrierWave if that makes any difference.
Please can someone help me as this is driving me insane!
Edit
The error says it's on line 1. Line 1 (assuming it doesn't index from 0) is
"Survey","RD","GarrysMDs","NigelsMDs","PaulsMDs","StephensMDs","BrinleyJ","CarolineP","DaveL","GrantR","GregS","Kent","NeilC","NicolaP","AndyC","DarrenS","DeanB","KarenF","PaulR","RichardF","SteveG","BrianG","GordonA","NickD","NickR","NickT","RayL","SimonH","EdmondH","JasonF","MikeS","SamanthaN","TimB","TravisF","AlanS","Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8PM","Q8N","Q9","Q10","Q11","Q12","Q13","Q14","Q15","Q16PM","Q16N","Q17PM","Q17N","Q18PM","Q18N","Q19","Q20","Q21","Q22","comment","Q23.1","Q23.2","Q23.3","TQ23.1","TQ23.2","VPM","VN","VQ1","VQ2","VQ3","VQ4","VQ5","VQ6","VQ7","VQ8N","VQ8PM","VQ9","VQ10","VQ11","VQ12","VQ13","VQ14","VQ15","VQ16","VQ16N","VQ16PM","VQ17","VQ17N","VQ17PM","VQ18","VQ18N","VQ18PM","VQ19","VQ20","VQ21","VQ22","VQ23.1","VQ23.2","VQ23.3","VRD","XQ16","XQ17","XQ18"
Well that was irritating!
Turns out the file had a BOM which was causing the CSV parser to break. Loading the file with
CSV.open("path/to/file.csv", "rb:bom|encoding")
allowed it to parse it perfectly! So annoyed how long it took to track down but it's now working and with no need to convert to UTF-8 now either!

Resources