special characters/chinese language text not rendering properly in csv - ruby-on-rails

I am using charset=utf-8; in the csv options. The chinese characters and some special characters ( for eg :- '»', ) are still not rendered correctly in the csv, though the same text looks fine in the browser.

I am using r:ISO-8859-1 for asia specific region

Related

Auto detect language and display the correct one with javascript

I am making a website for my friend
https://photos4humanity.herokuapp.com/
I'm thinking to pull the post from its facebook page and display it on the website so he doesnt have to duplicate content for both.
Each facebook post has both english and chinese in it. like here :
https://www.facebook.com/photosforhumanity/
I would like to auto detect the language from the json file I get from facebook. Then detect which is in English and which is in Chinese then only display the right language according to internatioanlize from rails.
Is there a smart way to do this?
You could use Regex to detect if the string has any English characters or not:
isEnglish = myString.match(/[a-zA-Z]/)
or
isEnglish = myString =~ /[a-zA-Z]/
I haven't tested either of these and I don't know how your json file is organized, but this should work for a singular string.
Edit:
To pull the English characters out of the string, you can use the slice! method:
englishString = myString.slice!(/[a-zA-Z]/)
After doing that, myString should only contain non-English characters and englishString should contain only English characters.

objective-c, PDF, How to solve "failed to parse embedded CMap." issue in PDF Seaching?

I am trying perform searching text in PDF, My project works fine on mostly PDF, but it fails to search text on some PDF, and xcode shows this message on console :
"failed to parse embedded CMap." How to solve this issue, So that I can search text on all PDF. Any suggestion will be great. Thanks in advance .
In general, it is impossible to search for text in all PDFs. This is for two main reasons:
PDFs use character codes that do not correspond to Unicode. A Cmap is used in this case to associate PDF character codes with a Unicode, but is not required to be present in the PDF document.
Even if a Cmap is included, the characters of text are not guaranteed to appear in order in the PDF document. PDF displays the glyphs corresponding to a character code based on geometry not on text.

Character conversion in ruby 1.8.7 from pdftk unicode conversion results

I am parsing titles from pdf files using pdftk has various language specific characters in it.
This ruby on rails application I need to do this in is using ruby 1.8.7 and rails 2.3.14 so any encoding solutions built into ruby 1.9 aren't an option for me right now.
Example of what I need to do:
If the title includes a ü, when I read the pdf content using pdftk (either command line or using ruby pdf-toolkit gem) the "ü" gets converted to ü
In my application, I really want this in the ü as this seems to work fine for my needs in a web page and in XML file.
I can convert the character explicitly in ruby using
>> string = "ü"
=> "ü"
>> string.gsub("ü","ü")
=> "ü"
but obviously I don't want to do this one by one.
I've tried using Iconv to do this but I feel I don't know what to specify to get this converted to the rendered character. I thought maybe this was just a utf-8 but it doesn't seem to convert to rendered character
>> Iconv.iconv("latin1", "utf-8","ü").join
=> "ü"
I am little confused about what format to/from to use here to get the end result of the rendered character.
So how do use Iconv or other tools to make this conversion for all characters converted to this HTML code from pdftk?
Or how to tell pdftk to do this when I read the pdf file in the first place!
Ok - I think the issue here is the codes that pdftk are returning are HTML so unescaping the HTML first is the path that works
>> Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(string) ).join
=> "ü"
Update:
Using the following
pdf = PDF::Toolkit.open(file)
pdf.title = Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(pdf.title)).join
This seems to work for most languages but when I apply this to japanese and chinese, it mangles things and doesn't result in the original as it appears in the PDF.
Update:
Getting closer - it appears that the html codes pdftk puts in the title for japanese and chinese already render correctly if I just unescape them and don't attempt any Iconv conversion.
CGI.unescapeHTML(pdf.title)
This renders correctly.
So... how do I test the pdf.title to see ahead of time if this is chinese or japanese (double byte ?) before I try to apply the conversion needed for other languages?
Maybe something like:
string.gsub(/&#\d+;/){|x| x[/\d+/].to_i.chr}

lua reading chinese character

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?
I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.
I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

Problem with cyrillic characters in Ruby on Rails

In my rails app I work a lot with cyrillic characters. Thats no problem, I store them in the db, I can display it in html.
But I have a problem exporting them in a plain txt file. A string like "элиас" gets "—ç–ª–∏–∞—Å" if I let rails put in in a txt file and download it. Whats wrong here? What has to be done?
Regards,
Elias
Obviously, there's a problem with your encoding. Make sure you text is in Unicode before writing it to the text file. You may use something like this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
your_unicode_text = ic.iconv(your_text + ' ')[0..-2]
Also, double check that your database encoding is UTF-8. Cyrillic characters can display fine in DB and in html with non-unicode encoding, e.g. KOI8-RU, but you're guaranteed to have problems with them elsewhere.

Resources