Rails 3 dealing with special characters - ruby-on-rails

I want to provide user with ability to fill-in input field with special characters (i.e. ¥ and others).
User input could be saved in xml file and later fetched and rendered back to form input.
What is the best practice of saving special symbols to xml (maybe using html entities or hexadecimal form)?
Thanks for advance.

I'd say if you save the file in utf-8 you will have no problems.
If some controller/view has problems with encoding you have to place this in the first line:
# encoding: utf-8

There's nothing special about them and you can don't need to encode them. Let your XML library deal with that, XML supports unicode ever since, and what you call "special symbols" are just unicode characters.

Related

Does HTML Encoding have any cons?

I develop a project on ASP.NET MVC framework. All files and charsets are in UTF-8. I'm using model bindings and in some of my models the display property includes some accented chars or single/double quotes.
As Razor engine automatically encodes helpers (ie. DisplayNameFor) the accented chars and quotes are encoded.
I may try to use some custom helpers to achieve rendering without encoding but I would like to learn if HTML encoding has any cons? I'm using UTF-8 encoding and I want to render text "Öger's tours" as it is. However it is rendered as "Öger's tours". I'm asking for this scenario.
(I've heard that search engine indexing performs better without encoded text. But I don't know why.)
Thank you.
The only mandatory character to entity encoding is for <, which starts the opening and closing tags of HTML elements, the & character, which otherwise starts an HTML entity, and (within attributes enclosed in double quotes) " to prevent terminating an attribute prematurely. It is also a good idea to use the entity for > to prevent confusing parsers.
For everything else it is absolutely enough to specify the proper charset encoding and properly apply it in the HTML file. There is particularly no need to encode ' outside attribute values enclosed in single quotes or umlauts, ligatures or other non-ASCII characters if the HTML file's charset supports them.
I found the solution as using the AntiXSS library for Razor encoderType. This answer describes it well. Special characters in html output
The default Razor encoder encodes accented chars whereas the AntiXSS library does not encode them. So, accented chars are rendered as they are.

Character encoding, how do I tell the difference?

Characters coming out of my database are encoded differently than the same characters written directly in the source. For exmaple, the word Permissões shows a different result when the string is written directly in the HTML, than when the string is output from a db record.
# From the source
Addressable::URI.encode("Permissões.pdf") #=> "Permiss%C3%B5es.pdf"
# From the db
Addressable::URI.encode("Permissões.pdf") #=> "Permisso%CC%83es.pdf"
The encodings are different. But my database is set to UTF-8, and I am using HTML5. What could be causing this?
I am unable to download files I upload to S3 because of this issue. I tried to force the encoding attachment.path.encode("UTF-8") but that makes no diffrence.
To solve this, since I am using Rails, I used ActiveSupport::Multibyte::Unicode to normalize any unicode characters before they get inserted into the database.
before_save do
self.path = ActiveSupport::Multibyte::Unicode.normalize(path)
end

Isn't user data that comes in from a form in Rails going to be UTF-8 encoded?

A Rails 3.2 app I'm contributing to has a method that coerces user input to UTF-8.
require "iconv"
def normalize(user_input_text)
Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv(user_input_text.dup)
end
It basically encodes the string in UTF-8 and ignores characters that can't be transcoded.
But isn't all user data that's entering Rails through a form going to be UTF-8 encoded?
In other words, isn't this code specious and unnecessary?
These resources suggest that indeed you are right.
Now that the vast majority of web input is UTF-8, we set
the inbound parameters to UTF-8. This will eliminate many
cases of incompatible encodings between ASCII-8BIT and
UTF-8.
https://github.com/rails/rails/commit/25215d7285db10e2c04d903f251b791342e4dd6a
Rails 3 solves this very nicely by doing a number of things including interpreting params as UTF-8 and adding workarounds for Internet Explorer
http://jasoncodes.com/posts/ruby19-rails2-encodings

lua reading chinese character

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?
I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.
I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

How to disable UTF character (punctuation) escaping when creating XML using default to_xml with Rails?

Given a rails models column that contains
"Something & Something Else" when outputting to_xml
Rails will escape the Ampersand like so:
<MyElement>Something & Something Else</MyElement>
Our client software is all UTF aware and it would be better if we can just leave the column content raw in our XML output.
There was an old solution that worked by setting $KCODE="UTF8" in an environment file, but this trick no longer works, and was always an All or Nothing solution.
Any recommendations on how to disable this? on a case by case basis?
It does not matter if the client software is UTF-8-aware. An ampersand cannot be used unescaped in XML. If the software is supposed to also be XML-aware, then any content that includes ampersands is not allowed to be kept "raw".
This is nothing to do with Unicode (or "UTF"). Ampersands in XML must be escaped, otherwise it isn't XML, and no XML software will accept it. If you're saying you want the escaping disabled, then you're saying you don't want the output to be XML.

Resources