Rails: need advice about slug, URL, and CJK characters - ruby-on-rails

I'm actually building a multi-language application which will support at least English and Japanese.
The application must be able to have URIs such as domain.com/username-slug. While this works fine with Latin characters, it does not (or rather, it looks ugly) using Japanese characters : domain.com/三浦パン屋
I was thinking of using a random number when the username is Japanese, such as :
def generate_token
self.slug = loop do
random_token = SecureRandom.uuid.gsub("-", "").hex.to_s[0..8]
break random_token unless self.class.exists?(slug: random_token)
end
end
But I don't know if this is such a good idea. I am looking for advice from people who have already faced this issue/case. Thoughts?
Thanks

TL;DR summary:
Use UTF-8 everywhere
For URIs, percent-escape all characters except the few permitted in URLs
Encourage your customers to use browsers which behave well with UTF-8 URLs
Here's the more detailed explanation. What you are after is a system of URLs for your website which have five properties:
When displayed in the location bar of your user's browser, the URLs are legible to the user and in the user's preferred language.
When the user types or pastes the legible text in their preferred language into the location bar of their browser, the browser forms a URL which your site's HTTP server can interpret correctly.
When displayed in a web page, the URLs are legible to the user and in the user's preferred language.
When supplied as a link target in an HTML link, forms a URL which the user's web browser can correctly send to your site, and which your site's HTTP server can interpret correctly
When your site's HTTP server receives these URLs, it passes the URL to your application in a way the application can interpret correctly.
RFC 3986 URI Generic Syntax, section 2 Characters says,
This specification does not mandate any particular character encoding
for mapping between URI characters and the octets used to store or
transmit those characters.... A percent-encoding mechanism is used to
represent a data octet in a component when that octet's corresponding
character is outside the allowed set or is being used as a
delimiter...
The URIs in question are http:// URIs, however, so the HTTP spec also governs. RFC 2616 HTTP/1.1, Section 3.4 Character Sets, says that the encoding (there named 'character set', for consistency with the MIME spec) is specified using MIME's charset tags.
What that boils down to is that the URIs can be in a wide variety of encodings, but you are responsible for being sure your web site code and your HTTP server agree on what encoding you will use. The HTTP protocol treats the URIs largely as opaque octet streams. In practice, UTF-8 is a good choice. It covers the entire Unicode character repertoire, it's an octet-based encoding, and it's widely supported. Percent-encoding is straightforward to add and remove, for instance by Ruby's URI::Escape method.
Let's turn next to the browser. You should find out with what browsers your users are visiting your site. Test the URL handling of these browsers by pasting in URLs with Japanese language path elements, and seeing what URL your web server presents to your Ruby code. My main browser, Firefox 16.0.2 on Mac OS X, interprets characters pasted into its location bar as UTF-8, and uses that encoding plus percent-escaping when passing the URL to an HTTP request. Similarly, when it encounters a URL for an HTTP page which has non-latin characters, it removes the percent encoding of the URL and treats the resulting octets as if they were UTF-8 encoded. If the browsers your users favour behave the same way, then UTF-8 URLs will appear in Japanese to your users.
Do your customers insist on using browsers that don't behave well with percent-encoded URLs and UTF-8 encoded URL parts? Then you have a problem. You might be able to figure out some other encoding which the browsers do work well with, say Shift-JIS, and make your pages and web server respect that encoding instead. Or, you might try encouraging your users to switch to browser which support UTF-8 well.
Next, let's look at your site's web pages. Your code has control over the encoding of the web pages. Links in your pages will have link text, which of course can be in Japanese, and a link target, which must be in some encoding comprehensible to your web server. UTF-8 is a good choice for the web page's encoding.
So, you don't absolutely have to use UTF-8 everywhere. The important thing is that you pick one encoding that works well in all three parts of your ecosystem: the customers' web browsers, your HTTP server, and your web site code. Your customers control one part of this ecosystem. You control the other two.
Encode your URL-paths ("username-slugs") in this encoding, then percent-escape those URLs. Author and code your pages to use this encoding. The user experience should then satisfy the five requirements above. And I predict that UTF-8 is likely to be a good encoding choice.

Related

Why does IE11 transform my URL with parameters?

I have an intraweb application that I call with the following url, containing 1 parameter.
http://127.0.0.1:8888/?0001=„V‡&
When I enter the URL in IE11, it is transformed to the following.
http://127.0.0.1:8888/$/?0001=%EF%BF%BD%EF%BF%BDV%EF%BF%BD
It works in Chrome and Opera which passes parameter in it's original form to the application. Any ideas how I can stop IE11 from transforming the parameter?
I'm using IE11, Delphi 10.1 Berlin and Intraweb 14.0.53.
Non-ASCII characters like „ and ‡ are not allowed to appear un-encoded in a URL, per RFC 3986 (an IRI via RFC 3987 allows it, but HTTP does not use IRIs). Certain ASCII characters are also reserved as delimiters and must not appear un-encoded when used for character data.
IE is doing the correct thing. Before transmitting the URL, restricted characters must be charset-encoded to bytes (IE is using UTF-8, there is no way for a URL to specify the charset used) and then the bytes must be percent-encoded in %HH format.
The webserver is expected to reverse this process to obtain the original characters - convert %HH sequences to bytes, and then charset-decode the bytes to characters. The browser and web server must agree on the charset used. UTF-8 is usually used, but not always. Some foreign country servers might use their own locales instead.
This is part of the URL and HTTP specifications. All web servers are required to recognize and decode percent-encoded sequences.
Chrome and Opera should be doing the same thing at the networking layer that IE is doing, even if they are not updating their UIs to reflect it. You can use a packet sniffer, like Wirkshark or Fiddler, to verify exactly what each browser is actually transmitting to the web server.
Encoding the query-string seems to be the only answer according to this post.
Internet Explorer having problems with special chars in querystrings

National characters in URI

How should authors publish (by URI) resources with names containing non-ASCII (for example national) characters?
Considering all the various parties (HTML code, browser sending request, browser saving file do disk, server receiving and processing request and server storing the file) all (possibly) using various encodings it seems nearly impossible to have it working consistently. Or at least I never managed.
When it comes to web pages I’m already used to that and always replace national characters with corresponding Latin base characters.
But when it comes to external files (PDFs, images, …) it somehow “feels wrong” to “downgrade” the names. Especially if one expects users to save those files on disk. How to deal with this issue?

ExactTarget e-mails inconsistently rendering Korean or Chinese characters

I have a small ASP.Net MVC site that is scraping content from a client site for inclusion in an ExactTarget-generated e-mail. If a content area uses hard coded Chinese or Korean characters, the e-mails render properly on all clients. When an area calls to the MVC site, using
%%before; httpget; 1 "http://mysite/contentarea/?parm1=One&parm2=Two"%%
the resulting html being sent out doesn't render consistently on all clients. GMail handles it ok, but Yahoo and Hotmail do not. The resulting characters make it look like an encoding issue. I have the MVC site spitting out utf-8 à la
Response.ContentEncoding = System.Text.Encoding.UTF8;
This is the first time I've really had to play with the encoding, so that may be part of my problem. :-)
I've looked at the wiki at http://wiki.memberlandingpages.com/ but it has not been much help. What I'd like do is define in the AMPscript that the incoming stream from the MVC site is encoded utf-8 (or whatever). I'm assuming having things explicitly laid out should address this, but I don't know if there's something about Hotmail or Yahoo that needs to be managed somehow as well. Thanks for any help!
The only way (that I've found) to set the character encoding in Exacttarget is to request that they turn on internationalization settings for your account. Email them to let them know that you need that turned on and they should be able to sort you out soon.
Here's some documentation on it:
http://wiki.memberlandingpages.com/010_ExactTarget/020_Content/International_Sends?highlight=internationalization
You'll then have a drop-down when creating an email to specify the character encoding. I banged my head against the wall on this one for a long time before finding that documentation page. Hope that helps!
I know this is old, but just in case people are still hunting. I think beyond globalization of ET the httpget function will default to a WindowsCodePage 1252 encoding unless the source page headers specify utf-8 encoding.
http://wiki.memberlandingpages.com/010_ExactTarget/020_Content/AMPscript/AMPscript_Syntax_Guide/HTTP_AMPscript_Functions
"NOTE: ExactTarget honors any character set returned in the HTTP headers via Content-Type. For example, you can use a UTF-8 encoded HTML file with Content-Type: text/html; charset=utf-8 included in the header. If the encoding is not specified in the header, the application assumes all returned data will be in the character set WindowsCodePage 1252. You can change this default by contacting Global Support."

Web Hosting URL Length Limit?

I am designing a web application which is a tie in to my iPhone application. It sends massively large URLs to the web server (15000 about.) I was using NearlyFreeSpeech.net, but they only support URLS up to 2000 characters. I was wondering if anybody knows of web hosting that will support really large URLs? Thanks, Isaac
Edit: My program needs to open a picture in Safari. I could do this 2 ways:
send it base64 encoded in the URL and just echo the query parameters.
first POST it to the server in my application, then the server would send back a unique ID after storing the photo in a database, which I would append to a URL which I would open in Safari which retrieved the photo from the database and delete it from the database.
You see, I am lazy, and I know Mobile Safari can support URI's up to 80 000 characters, so I think this is a OK way to do it. If there is something really wrong with this, please tell me.
Edit: I ended up doing it the proper POST way. Thanks.
If you're sending 15,000 character long URLs, in all likelyhood:
alt text http://img16.imageshack.us/img16/3847/youredoingitwronga.jpg
Use something like an HTTP POST instead.
The limitations you're running up against aren't so much an issue with the hosts - it's more the fact that web servers have a limit for the length of a URL. According to this page, Apache limits you to around 4k characters, and IIS limits you to 16k by default.
Although it's not directly answering your question, and there is no official maximum length of a URL, browsers and servers have practical limits - see http://www.boutell.com/newfaq/misc/urllength.html for some details. In short, since IE (at least some versions in use) doesn't support URLs over 2,083 characters, it's probably wise to stay below that length.
If you need to just open it in Safari, and the server doesn't need to be involved, why not use a data: URI?
Sending long URIs over the network is basically never the right thing to do. As you noticed, some web hosts don't support long URIs. Some proxy servers may also choke on long URLs, which means that your app might not work for users who are behind those proxies. If you ever need to port your app to a different browser, other browsers may not support URIs that long.
If you need to get data up to a server, use a POST. Yes, it's an extra round trip, but it will be much more reliable.
Also, if you are uploading data to the server using a GET request, then you are vulnerable to all kinds of cross-site request forgery attacks; basically, an attacker can trick the user into uploading, say, goatse to their account simply by getting them to click on a link (perhaps hidden by TinyURL or another URL shortening service, or just embedded as a link in a web page when they don't look closely at the URL they're clicking on).
You should never use GET for sending data to the server, beyond query parameters that don't actually change anything on the server.

UTF-8 characters mangled in HTTP Basic Auth username

I'm trying to build a web service using Ruby on Rails. Users authenticate themselves via HTTP Basic Auth. I want to allow any valid UTF-8 characters in usernames and passwords.
The problem is that the browser is mangling characters in the Basic Auth credentials before it sends them to my service. For testing, I'm using 'カタカナカタカナカタカナカタカナカタカナカタカナカタカナカタカナ' as my username (no idea what it means - AFAIK it's some random characters our QA guy came up with - please forgive me if it is somehow offensive).
If I take that as a string and do username.unpack("h*") to convert it to hex, I get: '3e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a8' That seems about right for 32 kanji characters (3 bytes/6 hex digits per).
If I do the same with the username that's coming in via HTTP Basic auth, I get:
'bafbbaacbafbbaacbafbbaacbafbbaacbafbbaacbafbbaacbafbbaacbafbbaac'. It's obviously much shorter. Using the Firefox Live HTTP Headers plugin, here's the actual header that's being sent:
Authorization: Basic q7+ryqu/q8qrv6vKq7+ryqu/q8qrv6vKq7+ryqu/q8o6q7+ryqu/q8qrv6vKq7+ryqu/q8qrv6vKq7+ryqu/q8o=
That looks like that 'bafbba...' string, with the high and low nibbles swapped (at least when I paste it into Emacs, base 64 decode, then switch to hexl mode). That might be a UTF16 representation of the username, but I haven't gotten anything to display it as anything but gibberish.
Rails is setting the content-type header to UTF-8, so the browser should be sending in that encoding. I get the correct data for form submissions.
The problem happens in both Firefox 3.0.8 and IE 7.
So... is there some magic sauce for getting web browsers to send UTF-8 characters via HTTP Basic Auth? Am I handling things wrong on the receiving end? Does HTTP Basic Auth just not work with non-ASCII characters?
I want to allow any valid UTF-8 characters in usernames and passwords.
Abandon all hope. Basic Authentication and Unicode don't mix.
There is no standard(*) for how to encode non-ASCII characters into a Basic Authentication username:password token before base64ing it. Consequently every browser does something different:
Opera uses UTF-8;
IE uses the system's default codepage (which you have no way of knowing, other than it's never UTF-8), and silently mangles characters that don't fit into to it using the Windows ‘guess a random character that looks a bit like the one you wanted or maybe just not’ secret recipe;
Mozilla uses only the lower byte of character codepoints, which has the effect of encoding to ISO-8859-1 and mangling the non-8859-1 characters irretrievably... except when doing XMLHttpRequests, in which case it uses UTF-8;
Safari and Chrome encode to ISO-8859-1, and fail to send the authorization header at all when a non-8859-1 character is used.
*: some people interpret the standard to say that either:
it should be always ISO-8859-1, due to that being the default encoding for including raw 8-bit characters directly included in headers;
it should be encoded using RFC2047 rules, somehow.
But neither of these proposals are on topic for inclusion in a base64-encoded auth token, and the RFC2047 reference in the HTTP spec really doesn't work at all since all the places it might potentially be used are explicitly disallowed by the ‘atom context’ rules of RFC2047 itself, even if HTTP headers honoured the rules and extensions of the RFC822 family, which they don't.
In summary: ugh. There is little-to-no hope of this ever being fixed in the standard or in the browsers other than Opera. It's just one more factor driving people away from HTTP Basic Authentication in favour of non-standard and less-accessible cookie-based authentication schemes. Shame really.
It's a known shortcoming that Basic authentication does not provide support for non-ISO-8859-1 characters.
Some UAs are known to use UTF-8 instead (Opera comes to mind), but there's no interoperability for that either.
As far as I can tell, there's no way to fix this, except by defining a new authentication scheme that handles all of Unicode. And getting it deployed.
HTTP Digest authentication is no solution for this problem, either. It suffers from the same problem of the client being unable to tell the server what character set it's using and the server being unable to correctly assume what the client used.
Have you tested using something like curl to make sure it's not a Firefox issue? The HTTP Auth RFC is silent on ASCII vs. non-ASCII, but it does say the value passed in the header is the username and the password separated by a colon, and I can't find a colon in the string that Firefox is reporting sending.
If you are coding for Windows 8.1, note that the sample in the documentation for HttpCredentialsHeaderValue is (wrongly) using UTF-16 encoding. Reasonably good fix is to switch to UTF-8 (as ISO-8859-1 is not supported by CryptographicBuffer.ConvertStringToBinary).
See http://msdn.microsoft.com/en-us/library/windows/apps/windows.web.http.headers.httpcredentialsheadervalue.aspx.
Here a workaround we used today to circumvent the issue of non-ascii characters in the password of a colleague:
curl -u "USERNAME:`echo -n 'PASSWORT' | iconv -f ISO-8859-1 -t UTF-8`" 'URL'
Replace USERNAME, PASSWORD and URL with your values. This example uses shell command substitution to transform the password character encoding to UTF-8 before executing the curl command.
Note: I used here a ` ... ` evaluation instead of ${ ... } because it doesn't fail if the password contains a ! character... [shells love ! characters ;-)]
Illustration of what happens with non-ASCII characters:
echo -n 'zz<zz§zz$zz-zzäzzözzüzzßzz' | iconv -f ISO-8859-1 -t UTF-8
I might be a total ignorant, but I came to this post while looking for a problem while sending a UTF8 string as a header inside an ajax call.
I could solve my problem by encoding in Base64 the string right before sending it. That means that you could with some simple JS convert the form to base64 right before submittting and that way it can be conevrted back on the server side.
This simple tools allowed me to have utf8 strings send as simple ASCII. I found that thanks to this simple sentence:
base64 (this encoding is designed to make binary data survive transport through transport layers that are not 8-bit clean). http://www.webtoolkit.info/javascript-base64.html
I hope this helps somehow. Just trying to give back a little bit to the community!

Resources