UTF-8 characters mangled in HTTP Basic Auth username - ruby-on-rails

I'm trying to build a web service using Ruby on Rails. Users authenticate themselves via HTTP Basic Auth. I want to allow any valid UTF-8 characters in usernames and passwords.
The problem is that the browser is mangling characters in the Basic Auth credentials before it sends them to my service. For testing, I'm using 'カタカナカタカナカタカナカタカナカタカナカタカナカタカナカタカナ' as my username (no idea what it means - AFAIK it's some random characters our QA guy came up with - please forgive me if it is somehow offensive).
If I take that as a string and do username.unpack("h*") to convert it to hex, I get: '3e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a83e28ba3e28fb3e28ba3e38a8' That seems about right for 32 kanji characters (3 bytes/6 hex digits per).
If I do the same with the username that's coming in via HTTP Basic auth, I get:
'bafbbaacbafbbaacbafbbaacbafbbaacbafbbaacbafbbaacbafbbaacbafbbaac'. It's obviously much shorter. Using the Firefox Live HTTP Headers plugin, here's the actual header that's being sent:
Authorization: Basic q7+ryqu/q8qrv6vKq7+ryqu/q8qrv6vKq7+ryqu/q8o6q7+ryqu/q8qrv6vKq7+ryqu/q8qrv6vKq7+ryqu/q8o=
That looks like that 'bafbba...' string, with the high and low nibbles swapped (at least when I paste it into Emacs, base 64 decode, then switch to hexl mode). That might be a UTF16 representation of the username, but I haven't gotten anything to display it as anything but gibberish.
Rails is setting the content-type header to UTF-8, so the browser should be sending in that encoding. I get the correct data for form submissions.
The problem happens in both Firefox 3.0.8 and IE 7.
So... is there some magic sauce for getting web browsers to send UTF-8 characters via HTTP Basic Auth? Am I handling things wrong on the receiving end? Does HTTP Basic Auth just not work with non-ASCII characters?

I want to allow any valid UTF-8 characters in usernames and passwords.
Abandon all hope. Basic Authentication and Unicode don't mix.
There is no standard(*) for how to encode non-ASCII characters into a Basic Authentication username:password token before base64ing it. Consequently every browser does something different:
Opera uses UTF-8;
IE uses the system's default codepage (which you have no way of knowing, other than it's never UTF-8), and silently mangles characters that don't fit into to it using the Windows ‘guess a random character that looks a bit like the one you wanted or maybe just not’ secret recipe;
Mozilla uses only the lower byte of character codepoints, which has the effect of encoding to ISO-8859-1 and mangling the non-8859-1 characters irretrievably... except when doing XMLHttpRequests, in which case it uses UTF-8;
Safari and Chrome encode to ISO-8859-1, and fail to send the authorization header at all when a non-8859-1 character is used.
*: some people interpret the standard to say that either:
it should be always ISO-8859-1, due to that being the default encoding for including raw 8-bit characters directly included in headers;
it should be encoded using RFC2047 rules, somehow.
But neither of these proposals are on topic for inclusion in a base64-encoded auth token, and the RFC2047 reference in the HTTP spec really doesn't work at all since all the places it might potentially be used are explicitly disallowed by the ‘atom context’ rules of RFC2047 itself, even if HTTP headers honoured the rules and extensions of the RFC822 family, which they don't.
In summary: ugh. There is little-to-no hope of this ever being fixed in the standard or in the browsers other than Opera. It's just one more factor driving people away from HTTP Basic Authentication in favour of non-standard and less-accessible cookie-based authentication schemes. Shame really.

It's a known shortcoming that Basic authentication does not provide support for non-ISO-8859-1 characters.
Some UAs are known to use UTF-8 instead (Opera comes to mind), but there's no interoperability for that either.
As far as I can tell, there's no way to fix this, except by defining a new authentication scheme that handles all of Unicode. And getting it deployed.

HTTP Digest authentication is no solution for this problem, either. It suffers from the same problem of the client being unable to tell the server what character set it's using and the server being unable to correctly assume what the client used.

Have you tested using something like curl to make sure it's not a Firefox issue? The HTTP Auth RFC is silent on ASCII vs. non-ASCII, but it does say the value passed in the header is the username and the password separated by a colon, and I can't find a colon in the string that Firefox is reporting sending.

If you are coding for Windows 8.1, note that the sample in the documentation for HttpCredentialsHeaderValue is (wrongly) using UTF-16 encoding. Reasonably good fix is to switch to UTF-8 (as ISO-8859-1 is not supported by CryptographicBuffer.ConvertStringToBinary).
See http://msdn.microsoft.com/en-us/library/windows/apps/windows.web.http.headers.httpcredentialsheadervalue.aspx.

Here a workaround we used today to circumvent the issue of non-ascii characters in the password of a colleague:
curl -u "USERNAME:`echo -n 'PASSWORT' | iconv -f ISO-8859-1 -t UTF-8`" 'URL'
Replace USERNAME, PASSWORD and URL with your values. This example uses shell command substitution to transform the password character encoding to UTF-8 before executing the curl command.
Note: I used here a ` ... ` evaluation instead of ${ ... } because it doesn't fail if the password contains a ! character... [shells love ! characters ;-)]
Illustration of what happens with non-ASCII characters:
echo -n 'zz<zz§zz$zz-zzäzzözzüzzßzz' | iconv -f ISO-8859-1 -t UTF-8

I might be a total ignorant, but I came to this post while looking for a problem while sending a UTF8 string as a header inside an ajax call.
I could solve my problem by encoding in Base64 the string right before sending it. That means that you could with some simple JS convert the form to base64 right before submittting and that way it can be conevrted back on the server side.
This simple tools allowed me to have utf8 strings send as simple ASCII. I found that thanks to this simple sentence:
base64 (this encoding is designed to make binary data survive transport through transport layers that are not 8-bit clean). http://www.webtoolkit.info/javascript-base64.html
I hope this helps somehow. Just trying to give back a little bit to the community!

Related

Why does IE11 transform my URL with parameters?

I have an intraweb application that I call with the following url, containing 1 parameter.
http://127.0.0.1:8888/?0001=„V‡&
When I enter the URL in IE11, it is transformed to the following.
http://127.0.0.1:8888/$/?0001=%EF%BF%BD%EF%BF%BDV%EF%BF%BD
It works in Chrome and Opera which passes parameter in it's original form to the application. Any ideas how I can stop IE11 from transforming the parameter?
I'm using IE11, Delphi 10.1 Berlin and Intraweb 14.0.53.
Non-ASCII characters like „ and ‡ are not allowed to appear un-encoded in a URL, per RFC 3986 (an IRI via RFC 3987 allows it, but HTTP does not use IRIs). Certain ASCII characters are also reserved as delimiters and must not appear un-encoded when used for character data.
IE is doing the correct thing. Before transmitting the URL, restricted characters must be charset-encoded to bytes (IE is using UTF-8, there is no way for a URL to specify the charset used) and then the bytes must be percent-encoded in %HH format.
The webserver is expected to reverse this process to obtain the original characters - convert %HH sequences to bytes, and then charset-decode the bytes to characters. The browser and web server must agree on the charset used. UTF-8 is usually used, but not always. Some foreign country servers might use their own locales instead.
This is part of the URL and HTTP specifications. All web servers are required to recognize and decode percent-encoded sequences.
Chrome and Opera should be doing the same thing at the networking layer that IE is doing, even if they are not updating their UIs to reflect it. You can use a packet sniffer, like Wirkshark or Fiddler, to verify exactly what each browser is actually transmitting to the web server.
Encoding the query-string seems to be the only answer according to this post.
Internet Explorer having problems with special chars in querystrings

Rails: need advice about slug, URL, and CJK characters

I'm actually building a multi-language application which will support at least English and Japanese.
The application must be able to have URIs such as domain.com/username-slug. While this works fine with Latin characters, it does not (or rather, it looks ugly) using Japanese characters : domain.com/三浦パン屋
I was thinking of using a random number when the username is Japanese, such as :
def generate_token
self.slug = loop do
random_token = SecureRandom.uuid.gsub("-", "").hex.to_s[0..8]
break random_token unless self.class.exists?(slug: random_token)
end
end
But I don't know if this is such a good idea. I am looking for advice from people who have already faced this issue/case. Thoughts?
Thanks
TL;DR summary:
Use UTF-8 everywhere
For URIs, percent-escape all characters except the few permitted in URLs
Encourage your customers to use browsers which behave well with UTF-8 URLs
Here's the more detailed explanation. What you are after is a system of URLs for your website which have five properties:
When displayed in the location bar of your user's browser, the URLs are legible to the user and in the user's preferred language.
When the user types or pastes the legible text in their preferred language into the location bar of their browser, the browser forms a URL which your site's HTTP server can interpret correctly.
When displayed in a web page, the URLs are legible to the user and in the user's preferred language.
When supplied as a link target in an HTML link, forms a URL which the user's web browser can correctly send to your site, and which your site's HTTP server can interpret correctly
When your site's HTTP server receives these URLs, it passes the URL to your application in a way the application can interpret correctly.
RFC 3986 URI Generic Syntax, section 2 Characters says,
This specification does not mandate any particular character encoding
for mapping between URI characters and the octets used to store or
transmit those characters.... A percent-encoding mechanism is used to
represent a data octet in a component when that octet's corresponding
character is outside the allowed set or is being used as a
delimiter...
The URIs in question are http:// URIs, however, so the HTTP spec also governs. RFC 2616 HTTP/1.1, Section 3.4 Character Sets, says that the encoding (there named 'character set', for consistency with the MIME spec) is specified using MIME's charset tags.
What that boils down to is that the URIs can be in a wide variety of encodings, but you are responsible for being sure your web site code and your HTTP server agree on what encoding you will use. The HTTP protocol treats the URIs largely as opaque octet streams. In practice, UTF-8 is a good choice. It covers the entire Unicode character repertoire, it's an octet-based encoding, and it's widely supported. Percent-encoding is straightforward to add and remove, for instance by Ruby's URI::Escape method.
Let's turn next to the browser. You should find out with what browsers your users are visiting your site. Test the URL handling of these browsers by pasting in URLs with Japanese language path elements, and seeing what URL your web server presents to your Ruby code. My main browser, Firefox 16.0.2 on Mac OS X, interprets characters pasted into its location bar as UTF-8, and uses that encoding plus percent-escaping when passing the URL to an HTTP request. Similarly, when it encounters a URL for an HTTP page which has non-latin characters, it removes the percent encoding of the URL and treats the resulting octets as if they were UTF-8 encoded. If the browsers your users favour behave the same way, then UTF-8 URLs will appear in Japanese to your users.
Do your customers insist on using browsers that don't behave well with percent-encoded URLs and UTF-8 encoded URL parts? Then you have a problem. You might be able to figure out some other encoding which the browsers do work well with, say Shift-JIS, and make your pages and web server respect that encoding instead. Or, you might try encouraging your users to switch to browser which support UTF-8 well.
Next, let's look at your site's web pages. Your code has control over the encoding of the web pages. Links in your pages will have link text, which of course can be in Japanese, and a link target, which must be in some encoding comprehensible to your web server. UTF-8 is a good choice for the web page's encoding.
So, you don't absolutely have to use UTF-8 everywhere. The important thing is that you pick one encoding that works well in all three parts of your ecosystem: the customers' web browsers, your HTTP server, and your web site code. Your customers control one part of this ecosystem. You control the other two.
Encode your URL-paths ("username-slugs") in this encoding, then percent-escape those URLs. Author and code your pages to use this encoding. The user experience should then satisfy the five requirements above. And I predict that UTF-8 is likely to be a good encoding choice.

Rails, sending mail to an address with accented characters

I am sending emails via Rails, ActionMailer, 1.9 Ruby and rails 3.0
All is good, I am sending emails with accented characters in subject lines and body, without issue. My charset default is UTF-8.
However when I try to send an email to an address containing accented characters it is failing miserably. I first had errors about the email address being invalid and it needing to be fully qualified.
To get around that, I needed to specify the email address in the format '"" '.
However it is sending now, but the characters in the address on the mail client, appear as =?UTF-8?Q?.... which is correct, Rails is rightly encoding my UTF8 address into the header for me.
BUT
My mail client is not recognising this in its display, so it renders all garbled on screen. garbled as in the actual text =?UTF-8?Q?.... appears in the "To" field on the client.
The encoding is UTF8 etc. charset is UTF8, Transfer Encoding is quotable printable.
What am I missing? It is doing my head in!
Also, as a test, I sent an email from my mac mail client to an address with accented characters. This renders fine in my client, however the headings are totally different... as in the charset is an iso, the transfer encoding is base64.... so I am thinking I need to somehow change actionmailer to encode my mails differently? i.e. using iso and base64 encoding to get it to play nice?
I tried this but to no avail. I am either doing it wrong or completely missing the point here? From readong the various forums and sites on this, I need to encode the header fields in a certain way, but I am failing to find the answers I need to tell me exactly what that encoding is and more specifically how can I do this in Rails?
Please help! :-)
finally solved this, if you wrap the local part of the email in quotes, and leave the domain part unquoted it works a treat. Seems like Mailer is encoding the full email address if you dont wrap in quotes, and hence breaks the encoding over to the server.
e.g.
somébody#here.com wont work
where as
"somébody"#here.com will work
routes through fine and displays fine in all clients.
Currently not all mail servers support UTF-8 email addresses (aka SMTPUTF8 ) a lot of them will do crazy things (even content malformation of headers). Can you check to ensure that your encoding header made it all the way through the mail server and wasn't ripped out?
The MTA would have to support RFC6530 to support UTF-8 addresses so it may not be your applications fault.

Axi2 client made with wsdl2Java uses UTF-8 instead of UTF-16

I'm using Axis2 1.6.1 to create a webservice, both the server and the client. The webservice is pretty simple: it receives two strings and returns an array of bytes. The issue I'm finding is that the client is sending the request encoded as UTF-8, so when I send a text in Spanish with accents, they get replaced by some strange characters. How can I force the client to use UTF-16?
Thanks
Jose Luis
you need to set it as a service client property. Please have a look at here[1].
[1] http://wso2.org/library/230

ExactTarget e-mails inconsistently rendering Korean or Chinese characters

I have a small ASP.Net MVC site that is scraping content from a client site for inclusion in an ExactTarget-generated e-mail. If a content area uses hard coded Chinese or Korean characters, the e-mails render properly on all clients. When an area calls to the MVC site, using
%%before; httpget; 1 "http://mysite/contentarea/?parm1=One&parm2=Two"%%
the resulting html being sent out doesn't render consistently on all clients. GMail handles it ok, but Yahoo and Hotmail do not. The resulting characters make it look like an encoding issue. I have the MVC site spitting out utf-8 à la
Response.ContentEncoding = System.Text.Encoding.UTF8;
This is the first time I've really had to play with the encoding, so that may be part of my problem. :-)
I've looked at the wiki at http://wiki.memberlandingpages.com/ but it has not been much help. What I'd like do is define in the AMPscript that the incoming stream from the MVC site is encoded utf-8 (or whatever). I'm assuming having things explicitly laid out should address this, but I don't know if there's something about Hotmail or Yahoo that needs to be managed somehow as well. Thanks for any help!
The only way (that I've found) to set the character encoding in Exacttarget is to request that they turn on internationalization settings for your account. Email them to let them know that you need that turned on and they should be able to sort you out soon.
Here's some documentation on it:
http://wiki.memberlandingpages.com/010_ExactTarget/020_Content/International_Sends?highlight=internationalization
You'll then have a drop-down when creating an email to specify the character encoding. I banged my head against the wall on this one for a long time before finding that documentation page. Hope that helps!
I know this is old, but just in case people are still hunting. I think beyond globalization of ET the httpget function will default to a WindowsCodePage 1252 encoding unless the source page headers specify utf-8 encoding.
http://wiki.memberlandingpages.com/010_ExactTarget/020_Content/AMPscript/AMPscript_Syntax_Guide/HTTP_AMPscript_Functions
"NOTE: ExactTarget honors any character set returned in the HTTP headers via Content-Type. For example, you can use a UTF-8 encoded HTML file with Content-Type: text/html; charset=utf-8 included in the header. If the encoding is not specified in the header, the application assumes all returned data will be in the character set WindowsCodePage 1252. You can change this default by contacting Global Support."

Resources