Consistent Encoding for iCal file import - ruby-on-rails

I'm trying to use the iCalendar gem to import some iCal files on a rails 4 site.
Sometimes the file is of type 'text/calendar;charset=utf-8' and sometimes its 'text/calendar; charset=UTF-8;'
I am retrieving it like this:
uri = URI.parse(url)
calendar = Net::HTTP.get_response(uri)
new_calendar = Icalendar.parse(calendar.body)
When its text/calendar;charset=utf-8 it works fine. but when its text/calendar; charset=UTF-8 encoded I get UTF codes in the string
SUMMARY:Tech Job Fair – City(ST) – Jul 1, 2015
ends up being
["Tech Job Fair \xE2\x80\x93 City(ST) \xE2\x80\x93 Jul 1", " 2015"]
Which is then saved to the database and that is undesirable.
Is the charset/content-type revealing the problem here or could it actually just be encoded wrong from the source?
How do I change my retrieval commands to strip those codes out effectively or tell it its a UTF string so it doesn't include them in the first place?
Update: it looks like some are text/calendar;charset=utf-8 and some are text/calendar;charset=UTF-8 and some are text/calendar; charset=UTF-8. Note the last one has a space between the two segments. Could this be causing an issue?
Update2: Opening up my three example iCal files in Notepad++ shows them encoded as "UTF-8 without BOM" in the menu.

Related

Problem with attachments' character encoding using gmail gem in ruby/rails

What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!
Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).
It seems like you need to do attachment.body.decoded instead of attachment.decoded

How to correctly handle character encoding when using Postgresql's copy_data function?

In my Rails app, I managed to stream large CSV files directly from Postgres based on solutions mentioned in this SO post. My working code looks somewhat like so:
query = <A Long SQL Query String>
response.headers["Cache-Control"] = "no-cache"
response.headers["Content-Type"] = "text/csv; charset=utf-8"
response.headers["Content-Disposition"] =
%(attachment; filename="#{csv_filename}")
response.headers["Last-Modified"] = Time.now.ctime.to_s
conn = ActiveRecord::Base.connection.raw_connection
conn.copy_data("COPY (#{query}) TO STDOUT WITH (FORMAT CSV, HEADER TRUE, FORCE_QUOTE *, ESCAPE E'\\\\');") do
while row = conn.get_copy_data
response.stream.write row
end
end
response.stream.close
end
Some of the columns (VARCHAR) being queried have values as either English or Chinese strings. The CSV file resulting from the above code doesn’t show the Chinese characters as is. Instead, I get something like this:
大大 文文
Am I supposed to change the way I’m using the copy_data function, or is there something I could do to the CSV file to solve this? I’ve tried saving the file as UTF-8 .txt file, as well as trying the convert_to function mentioned in the copy_data documentation, but to no avail.
This depends of the original encoding included in the CSV file.
Do this on Linux :
file -i you_file
Are you sure it's not UTF-16 or GB 18030 ?
And also in what kind of encoding is setup your database ?
do a \l in psql to see this.
So it boiled down to my MS Excel not being able to render the Chinese chars correctly. On MacOS, opening the same .csv file using the Numbers app (or even Atom, for that matter) resolved this issue for me.

How do I parse an Excel file that will give me data exactly as it appears visually?

I'm on Rails 5 (Ruby 2.4). I want to read an .xls doc and I would like to get the data into CSV format, just as it appears in the Excel file. Someone recommended I use Roo, and so I have
book = Roo::Spreadsheet.open(file_location)
sheet = book.sheet(0)
text = sheet.to_csv
arr_of_arrs = CSV.parse(text)
However what is getting returned is not the same as what I see in the spreadsheet. For isntance, a cell in the spreadsheet has
16:45.81
and when I get the CSV data from above, what is returned is
"0.011641319444444444"
How do I parse the Excel doc and get exactly what I see? I don't care if I use Roo to parse or not, just as long as I can get CSV data that is a representation of what I see rather than some weird internal representation. For reference the file type I was parsing givies this when I run "file name_of_file.xls" ...
Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1252, Author: Dwight Schroot, Last Saved By: Dwight Schroot, Name of Creating Application: Microsoft Excel, Create Time/Date: Tue Sep 21 17:05:21 2010, Last Saved Time/Date: Wed Oct 13 16:52:14 2010, Security: 0
You need to save the custom formula in a text format on the .xls side. If your opening the .xls file from the internet this won't work but this will fix your problem if you can manipulate the file. You can do this using the function =TEXT(A2, "mm:ss.0") A2 is just the cell I'm using as an example.
book = ::Roo::Spreadsheet.open(file_location)
puts book.cell('B', 2)
=> '16.45.8'
If manipulating the file is not an option you could just pass a custom converter to CSV.new() and convert the decimal time back to the correct format you need.
require 'roo-xls'
require 'csv'
CSV::Converters[:time_parser] = lambda do |field, info|
case info[:header].strip
when "time" then begin
# 0.011641319444444444 * 24 hours * 3600 seconds = 1005.81
parse_time = field.to_f * 24 * 3600
# 1005.81.divmod(60) = [16, 45.809999999999999945]
mm, ss = parse_time.divmod(60)
# returns "16:45.81"
time = "#{mm}:#{ss.round(2)}"
time
rescue
field
end
else
field
end
end
book = ::Roo::Spreadsheet.open(file_location)
sheet = book.sheet(0)
csv = CSV.new(sheet.to_csv, headers: true, converters: [:time_parser]).map {|row| row.to_hash}
puts csv
=> {"time "=>"16:45.81"}
{"time "=>"12:46.0"}
Under the hood roo-xls gem uses the spreadsheet gem to parse the xls file. There was a similar issue to yours logged here, but it doesn't appear that there was any real resolution. Internally xls stores 16:45.81 as a Number and associates some formatting with it. I believe the issue has something to do with the spreadsheet gem not correctly handling the cell format.
I did try messing around with adding a format mm:ss.0 by following this guide but I couldn't get it to work, maybe you'll have more luck.
You can use converters option. It seems looking like this:
arr_of_arrs = CSV.parse(text, {converters: :date_time})
http://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html
Your problem seems to be with the way you're parsing (reading) the input file.
roo parses only Excel 2007-2013 (.xlsx) files. From you question, you want to parse .xls, which is a different format.
Like the documentation says, use the roo-xls gem instead.

Indy is altering the binary data in my URL

I want to send some binary data over via GET using the Indy components.
So, I have an URL like www.awebsite.com/index.php?data=xxx where xxx is the binary data encoded using ParamsEncode function. After encoding the binary data is converted to something like bB7%18%11z\ so my URL is something like:
www.awebsite.com/bB7%18%11z\
I have seen that if my URL contains the backshash char (see the last char in the URL) it is replaced with slash char (/) in TIdURI.NormalizePath so my binary data is corrupted. What am I doing wrong?
Backslashes aren't allowed in URL's, and to avoid confusion between Windows and *nix systems, all backslashes are replaced by slashes to attempt to keep things working. See http://www.faqs.org/rfcs/rfc1738.html section 5, HTTP, httpurl
You could try with replacing backslashes with %5C yourself.
That said, you should either try with MIME encoding, or try to get a hang of POST requests.
You're using an old version of Indy. Backslashes are included in the UnsafeChars list that Indy uses now. Remy changed it in July 2010 with revision 4272 in the Tiburon branch:
r4272 | Indy-RemyLebeau | 2010-07-07 03:12:23 -0500 (Wed, 07 Jul 2010) | 1 line
Internal logic changes for TIdURI, and moved some sharable logic into IdGlobalProtocols.pas for later use in TIdHTTP.
It was merged into the trunk with the rest of Indy 10.5.7 with revision 4394, in September 2010.

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources