Character conversion in ruby 1.8.7 from pdftk unicode conversion results - ruby-on-rails

I am parsing titles from pdf files using pdftk has various language specific characters in it.
This ruby on rails application I need to do this in is using ruby 1.8.7 and rails 2.3.14 so any encoding solutions built into ruby 1.9 aren't an option for me right now.
Example of what I need to do:
If the title includes a ü, when I read the pdf content using pdftk (either command line or using ruby pdf-toolkit gem) the "ü" gets converted to ü
In my application, I really want this in the ü as this seems to work fine for my needs in a web page and in XML file.
I can convert the character explicitly in ruby using
>> string = "ü"
=> "ü"
>> string.gsub("ü","ü")
=> "ü"
but obviously I don't want to do this one by one.
I've tried using Iconv to do this but I feel I don't know what to specify to get this converted to the rendered character. I thought maybe this was just a utf-8 but it doesn't seem to convert to rendered character
>> Iconv.iconv("latin1", "utf-8","ü").join
=> "ü"
I am little confused about what format to/from to use here to get the end result of the rendered character.
So how do use Iconv or other tools to make this conversion for all characters converted to this HTML code from pdftk?
Or how to tell pdftk to do this when I read the pdf file in the first place!

Ok - I think the issue here is the codes that pdftk are returning are HTML so unescaping the HTML first is the path that works
>> Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(string) ).join
=> "ü"
Update:
Using the following
pdf = PDF::Toolkit.open(file)
pdf.title = Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(pdf.title)).join
This seems to work for most languages but when I apply this to japanese and chinese, it mangles things and doesn't result in the original as it appears in the PDF.
Update:
Getting closer - it appears that the html codes pdftk puts in the title for japanese and chinese already render correctly if I just unescape them and don't attempt any Iconv conversion.
CGI.unescapeHTML(pdf.title)
This renders correctly.
So... how do I test the pdf.title to see ahead of time if this is chinese or japanese (double byte ?) before I try to apply the conversion needed for other languages?

Maybe something like:
string.gsub(/&#\d+;/){|x| x[/\d+/].to_i.chr}

Related

Ruby: Convert <br> to newline URI encode

I want to share sometext on whatsapp so I'm converting html to text otherwise it displays all the tags.
Currently I'm using strip_tags to remove tags but that also removes breaks from the text. How do I convert html to text and convert breaks to newline characters and url encode the text.
currently I'm using following
#whatsapp_text = u strip_tags(#post.summary)
I suggest you tu use Nokogiri to solve this problem. Nokogiri can parse HTML and convert Websites Source into human readable text although it doiesnt convert html breaks to linebreaks it can take away many problems from you. To do this add the follofing line to your Gemfile
gem 'nokogiri'
run bundle install. Then you can solve your problem like this:
Nokogiri::HTML.parse(#post.summary.gsub("<br>", "\r\n").gsub("<br/>", "\r\n")).inner_text
That should do it for you.

Stubborn character encoding errors when reading strings from text file (Ruby/Rails)

I've been trying to import a long text file generated from a PDF reader application (SODA-PDF). Source document is a script in PDF format.
The convertged text files look ok in note pad, but I get a variety of errors when trying to read the file into a string and manipulate it.
None of the following methods which I've seen in various threads seem to work:
clean1=Iconv.conv('ASCII//IGNORE', 'UTF8', s)
or
clean1=s.encode('UTF-8', invalid: :replace, undef: :replace, replace: '', UNIVERSAL_NEWLINE_DECORATOR: true)
or
clean1=s.gsub(/[\u0080-\u00ff]/,"")
The first method, using Iconv gives
Iconv::InvalidEncoding: invalid encoding ("ASCII", "UTF8")
when invoked.
The second method appears to work, but fails on various string manipulations like
lines= s.split("\n") unless s.blank?
with
ArgumentError: invalid byte sequence in UTF-8
(Either split or blank? will throw the exception.)
The 3rd method also fails with the 'invalid byte sequence in UTF-8' error.
I am quite hazy on the whole character encoding thing, so excuse any obvious stupidity here.
I'm going to try a character by character filtering, but that's kind of pain since the docs I am working with can be 100+ pages, and I'm hoping there's an easier solve.
Env: Win7 64/ ruby 1.9.3p484 (2013-11-22) [i386-mingw32] / Rails 4.0.3
I discovered that my source file was encoded in ISO-8859-1. Was able to convert to UTF-8 and it all works fine now.

Displaying ©, & symbol in excel with Ruby on Rails

I am exporting my data into an excel file with Spreadsheet gem and Ruby on Rails. I want to add header and footer to my excel file. The problem is when i am doing this, the copyright symbol, ampersand symbol and registered symbol are not displaying. Either it throws multibyte character error or it simply displays nothing.
I have gone through all similar problems and tried even # encoding utf-8 and "# -- coding: utf-8 --". It is of no use.
When i tried to use escape sequence("\u00A9" - unicode code for © ), the file format is being corrupted. Any possible solutions for this problem? Am i missing something?
Kindly help.
Thanks in advance
This code works for me:
def do_test
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1[0,0] = "\u00a9"
book.write "./sample.xls"
end
It is possible that you may have set the spreadsheet encoding to something other than UTF-8 at some point. You can check Spreadsheet.client_encoding to see what is being used.
UPDATE
The add_header/footer code is very encoding specific. Here is the code used:
def write_header
write_op opcode(:header), [#worksheet.header.bytesize, 0].pack("vC"), #worksheet.header
end
The Excel writer is using Unicode-1200 (UTF-16 little endian) by default. This may mean that you need to encode any non-standard characters using "\u00a9".encode('UTF-16LE') in order to get this to work...

Upgraded to Rails 3 and Ruby 1.9 and Unicode data in Postgres database now returning as ASCII (potential bug?)

I'm running into a really strange phenomena after upgrading from Rails 2.3/Ruby 1.8 to Rails 3/Ruby 1.9. As I mentioned in the title, I'm using Postgres, along with the pg gem 0.10.0.
When I make a call to a model's string or text fields that contain Unicode, it works correctly, and they are returned with an encoding of UTF-8.
However, I also make use of serialized Hashes in a number of models, and whenever I make a call to read their contents (which worked perfectly prior to the upgrade), I get the following puzzling behavior:
If the contents contains Unicode data, it returns as ASCII, and is displayed as escaped characters.
If the contents contains ASCII data, it returns as UTF-8 (correctly), and is properly displayed.
I can simply re-encode the Unicode-returned-as-ASCII strings back to UTF-8, and everything will work fine. However, that is definitely a hack, and doesn't strike me as a good approach.
Is there a way to make serialized UTF-8 fields display correctly? If this is a bug somewhere, any idea where, and if it's known already?
Does this answer it? Why are all strings ASCII-8BIT after I upgraded to Rails 3?

Problem with cyrillic characters in Ruby on Rails

In my rails app I work a lot with cyrillic characters. Thats no problem, I store them in the db, I can display it in html.
But I have a problem exporting them in a plain txt file. A string like "элиас" gets "—ç–ª–∏–∞—Å" if I let rails put in in a txt file and download it. Whats wrong here? What has to be done?
Regards,
Elias
Obviously, there's a problem with your encoding. Make sure you text is in Unicode before writing it to the text file. You may use something like this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
your_unicode_text = ic.iconv(your_text + ' ')[0..-2]
Also, double check that your database encoding is UTF-8. Cyrillic characters can display fine in DB and in html with non-unicode encoding, e.g. KOI8-RU, but you're guaranteed to have problems with them elsewhere.

Resources