Displaying ©, & symbol in excel with Ruby on Rails - ruby-on-rails

I am exporting my data into an excel file with Spreadsheet gem and Ruby on Rails. I want to add header and footer to my excel file. The problem is when i am doing this, the copyright symbol, ampersand symbol and registered symbol are not displaying. Either it throws multibyte character error or it simply displays nothing.
I have gone through all similar problems and tried even # encoding utf-8 and "# -- coding: utf-8 --". It is of no use.
When i tried to use escape sequence("\u00A9" - unicode code for © ), the file format is being corrupted. Any possible solutions for this problem? Am i missing something?
Kindly help.
Thanks in advance

This code works for me:
def do_test
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1[0,0] = "\u00a9"
book.write "./sample.xls"
end
It is possible that you may have set the spreadsheet encoding to something other than UTF-8 at some point. You can check Spreadsheet.client_encoding to see what is being used.
UPDATE
The add_header/footer code is very encoding specific. Here is the code used:
def write_header
write_op opcode(:header), [#worksheet.header.bytesize, 0].pack("vC"), #worksheet.header
end
The Excel writer is using Unicode-1200 (UTF-16 little endian) by default. This may mean that you need to encode any non-standard characters using "\u00a9".encode('UTF-16LE') in order to get this to work...

Related

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.
We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:
attachments['file-name.jpg'] = File.read('file-name.jpg')
From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.
Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?
I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.
Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).
If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO
...
attachments['attachment.txt'] = File.binread('/path/to/file')
...
If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.
...
attachments['attachment.txt'] = File.binread('/path/to/file')
.encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...
(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)
With your edit, this seems pretty clear to me:
The file on your filesystem is encoded in latin1.
File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).
If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...
When reading the file at time of attachment, I can use the following syntax.
mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")
The important addition is the following, which goes after the call to File.read(...):
.force_encoding("BINARY").gsub(0xA0.chr,"")
The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

Character encoding, how do I tell the difference?

Characters coming out of my database are encoded differently than the same characters written directly in the source. For exmaple, the word Permissões shows a different result when the string is written directly in the HTML, than when the string is output from a db record.
# From the source
Addressable::URI.encode("Permissões.pdf") #=> "Permiss%C3%B5es.pdf"
# From the db
Addressable::URI.encode("Permissões.pdf") #=> "Permisso%CC%83es.pdf"
The encodings are different. But my database is set to UTF-8, and I am using HTML5. What could be causing this?
I am unable to download files I upload to S3 because of this issue. I tried to force the encoding attachment.path.encode("UTF-8") but that makes no diffrence.
To solve this, since I am using Rails, I used ActiveSupport::Multibyte::Unicode to normalize any unicode characters before they get inserted into the database.
before_save do
self.path = ActiveSupport::Multibyte::Unicode.normalize(path)
end

Stubborn character encoding errors when reading strings from text file (Ruby/Rails)

I've been trying to import a long text file generated from a PDF reader application (SODA-PDF). Source document is a script in PDF format.
The convertged text files look ok in note pad, but I get a variety of errors when trying to read the file into a string and manipulate it.
None of the following methods which I've seen in various threads seem to work:
clean1=Iconv.conv('ASCII//IGNORE', 'UTF8', s)
or
clean1=s.encode('UTF-8', invalid: :replace, undef: :replace, replace: '', UNIVERSAL_NEWLINE_DECORATOR: true)
or
clean1=s.gsub(/[\u0080-\u00ff]/,"")
The first method, using Iconv gives
Iconv::InvalidEncoding: invalid encoding ("ASCII", "UTF8")
when invoked.
The second method appears to work, but fails on various string manipulations like
lines= s.split("\n") unless s.blank?
with
ArgumentError: invalid byte sequence in UTF-8
(Either split or blank? will throw the exception.)
The 3rd method also fails with the 'invalid byte sequence in UTF-8' error.
I am quite hazy on the whole character encoding thing, so excuse any obvious stupidity here.
I'm going to try a character by character filtering, but that's kind of pain since the docs I am working with can be 100+ pages, and I'm hoping there's an easier solve.
Env: Win7 64/ ruby 1.9.3p484 (2013-11-22) [i386-mingw32] / Rails 4.0.3
I discovered that my source file was encoded in ISO-8859-1. Was able to convert to UTF-8 and it all works fine now.

How to identify character encoding from website?

What I'm trying to do:
I'm getting from a database a list of uris and download them,
removing the stopwords and counting the frequency that the words appears in the webpage,
then trying to save in the mongodb.
The Problem:
When I try to save the result in the database I get the error
bson.errors.invalidDocument: the document must be a valid utf-8
it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something'
when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.
What I already tried
I've tried identify the char encode through the header from the webpage
I've tried utilize the chardet
utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.
What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'â'
(English isn't my first language so forgive me)
EDIT: if anyone want see the source code
It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.
from bs4 import BeautifulSoup
import urllib
url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)
# text is a Unicode string
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')
See also the answers to this question.

Character conversion in ruby 1.8.7 from pdftk unicode conversion results

I am parsing titles from pdf files using pdftk has various language specific characters in it.
This ruby on rails application I need to do this in is using ruby 1.8.7 and rails 2.3.14 so any encoding solutions built into ruby 1.9 aren't an option for me right now.
Example of what I need to do:
If the title includes a ü, when I read the pdf content using pdftk (either command line or using ruby pdf-toolkit gem) the "ü" gets converted to ü
In my application, I really want this in the ü as this seems to work fine for my needs in a web page and in XML file.
I can convert the character explicitly in ruby using
>> string = "ü"
=> "ü"
>> string.gsub("ü","ü")
=> "ü"
but obviously I don't want to do this one by one.
I've tried using Iconv to do this but I feel I don't know what to specify to get this converted to the rendered character. I thought maybe this was just a utf-8 but it doesn't seem to convert to rendered character
>> Iconv.iconv("latin1", "utf-8","ü").join
=> "ü"
I am little confused about what format to/from to use here to get the end result of the rendered character.
So how do use Iconv or other tools to make this conversion for all characters converted to this HTML code from pdftk?
Or how to tell pdftk to do this when I read the pdf file in the first place!
Ok - I think the issue here is the codes that pdftk are returning are HTML so unescaping the HTML first is the path that works
>> Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(string) ).join
=> "ü"
Update:
Using the following
pdf = PDF::Toolkit.open(file)
pdf.title = Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(pdf.title)).join
This seems to work for most languages but when I apply this to japanese and chinese, it mangles things and doesn't result in the original as it appears in the PDF.
Update:
Getting closer - it appears that the html codes pdftk puts in the title for japanese and chinese already render correctly if I just unescape them and don't attempt any Iconv conversion.
CGI.unescapeHTML(pdf.title)
This renders correctly.
So... how do I test the pdf.title to see ahead of time if this is chinese or japanese (double byte ?) before I try to apply the conversion needed for other languages?
Maybe something like:
string.gsub(/&#\d+;/){|x| x[/\d+/].to_i.chr}

Resources