How should authors publish (by URI) resources with names containing non-ASCII (for example national) characters?
Considering all the various parties (HTML code, browser sending request, browser saving file do disk, server receiving and processing request and server storing the file) all (possibly) using various encodings it seems nearly impossible to have it working consistently. Or at least I never managed.
When it comes to web pages I’m already used to that and always replace national characters with corresponding Latin base characters.
But when it comes to external files (PDFs, images, …) it somehow “feels wrong” to “downgrade” the names. Especially if one expects users to save those files on disk. How to deal with this issue?
Related
Background
I have a web application where users may upload a wide variety of files (e.g. jpg, png, r, csv, epub, pdf, docx, nb, tex, etc). Currently, we whitelist exactly which files types are a user may upload. This limitation is sometimes annoying for users (i.e. because they must zip disallowed files, then upload the zip) and for us (i.e. because users write support asking for additional file types to be whitelisted).
Ideal Solution
Ideally, I'd love to whitelist files more aggressively. Specifically, I'd would like to (1) figure out which file types may be trusted and (2) whitelist them all. Having a larger whitelist would be more convienient for users and it reduce the number of support tickets (if only very slightly).
What I Know
I've done a few hours of research and have identified common problems (e.g. path traversal, placing assets in root directory, htaccess vulnerabilities, failure to validate mime type, etc). While this research has been interesting, my understanding is that many of these issues are moot (or considerably mitigated) if your assets are stored on Amazon S3 (or a similar cloud storage service) – which is how most modern web application manage user-uploaded files.
Hasn't this question already been asked a zillion times?!
Please don't mistake this as a general "What are the security risks of user-uploaded content?" question. There are already many questions like that and I don't want to rehash that discussion here.
More specifically, my question is, "What risks, if any, exist given a conventional / modern web application setup?" In other words, I don't care about some old PHP app or vulnerabilities related to IE6. What should I be worried about assuming files are stored in a cloud service like AmazonS3?
Context about infrastructure / architecture
So... To answer that, you'll probably need more context about my setup. That said, I suspect this is a relatively common setup and therefore hope the answers will be broadly useful to anyone writing a modern web application.
My stack
Ruby on Rails application, hosted on Heroku
Users may upload a variety of files (via Paperclip)
Server validates both mime type and extension (against a whitelist)
Files are stored on Amazon S3 (with varying ACL permissions)
When a user uploads a file...
I upload the file directly on AS3 in a tmp folder (hasn't touched my server yet)
My server then downloads the file from the tmp folder on AS3.
Paperclip runs validations and executes any processing (e.g. cutting thumbnails of images)
Finally, Paperclip places the file(s) back on AS3 in their new, permanent location.
When a user downloads a file...
User clicks to download a file which sends a request to my API (e.g. /api/atricle/123/download)
Internally, my API reads the file from AS3 and then serves it to the user (as content type attachment)
Thus the file does briefly pass through my server (i.e. not merely a redirect)
From the user's perspective, the file is served from my API (i.e. the user has no idea the file live on AS3)
Questions
Given this setup, is it safe to whitelist a wide range of file types?
Are there some types of files that are always best avoided (e.g. JS files)?
Are there any glaring flaws in my setup? I suspect not, but if so, please alert me!
I'm actually building a multi-language application which will support at least English and Japanese.
The application must be able to have URIs such as domain.com/username-slug. While this works fine with Latin characters, it does not (or rather, it looks ugly) using Japanese characters : domain.com/三浦パン屋
I was thinking of using a random number when the username is Japanese, such as :
def generate_token
self.slug = loop do
random_token = SecureRandom.uuid.gsub("-", "").hex.to_s[0..8]
break random_token unless self.class.exists?(slug: random_token)
end
end
But I don't know if this is such a good idea. I am looking for advice from people who have already faced this issue/case. Thoughts?
Thanks
TL;DR summary:
Use UTF-8 everywhere
For URIs, percent-escape all characters except the few permitted in URLs
Encourage your customers to use browsers which behave well with UTF-8 URLs
Here's the more detailed explanation. What you are after is a system of URLs for your website which have five properties:
When displayed in the location bar of your user's browser, the URLs are legible to the user and in the user's preferred language.
When the user types or pastes the legible text in their preferred language into the location bar of their browser, the browser forms a URL which your site's HTTP server can interpret correctly.
When displayed in a web page, the URLs are legible to the user and in the user's preferred language.
When supplied as a link target in an HTML link, forms a URL which the user's web browser can correctly send to your site, and which your site's HTTP server can interpret correctly
When your site's HTTP server receives these URLs, it passes the URL to your application in a way the application can interpret correctly.
RFC 3986 URI Generic Syntax, section 2 Characters says,
This specification does not mandate any particular character encoding
for mapping between URI characters and the octets used to store or
transmit those characters.... A percent-encoding mechanism is used to
represent a data octet in a component when that octet's corresponding
character is outside the allowed set or is being used as a
delimiter...
The URIs in question are http:// URIs, however, so the HTTP spec also governs. RFC 2616 HTTP/1.1, Section 3.4 Character Sets, says that the encoding (there named 'character set', for consistency with the MIME spec) is specified using MIME's charset tags.
What that boils down to is that the URIs can be in a wide variety of encodings, but you are responsible for being sure your web site code and your HTTP server agree on what encoding you will use. The HTTP protocol treats the URIs largely as opaque octet streams. In practice, UTF-8 is a good choice. It covers the entire Unicode character repertoire, it's an octet-based encoding, and it's widely supported. Percent-encoding is straightforward to add and remove, for instance by Ruby's URI::Escape method.
Let's turn next to the browser. You should find out with what browsers your users are visiting your site. Test the URL handling of these browsers by pasting in URLs with Japanese language path elements, and seeing what URL your web server presents to your Ruby code. My main browser, Firefox 16.0.2 on Mac OS X, interprets characters pasted into its location bar as UTF-8, and uses that encoding plus percent-escaping when passing the URL to an HTTP request. Similarly, when it encounters a URL for an HTTP page which has non-latin characters, it removes the percent encoding of the URL and treats the resulting octets as if they were UTF-8 encoded. If the browsers your users favour behave the same way, then UTF-8 URLs will appear in Japanese to your users.
Do your customers insist on using browsers that don't behave well with percent-encoded URLs and UTF-8 encoded URL parts? Then you have a problem. You might be able to figure out some other encoding which the browsers do work well with, say Shift-JIS, and make your pages and web server respect that encoding instead. Or, you might try encouraging your users to switch to browser which support UTF-8 well.
Next, let's look at your site's web pages. Your code has control over the encoding of the web pages. Links in your pages will have link text, which of course can be in Japanese, and a link target, which must be in some encoding comprehensible to your web server. UTF-8 is a good choice for the web page's encoding.
So, you don't absolutely have to use UTF-8 everywhere. The important thing is that you pick one encoding that works well in all three parts of your ecosystem: the customers' web browsers, your HTTP server, and your web site code. Your customers control one part of this ecosystem. You control the other two.
Encode your URL-paths ("username-slugs") in this encoding, then percent-escape those URLs. Author and code your pages to use this encoding. The user experience should then satisfy the five requirements above. And I predict that UTF-8 is likely to be a good encoding choice.
I'm trying to build an iPad app to download and display documents (pdf, ppt, doc, etc.) from a web server.
Currently it does this by parsing the HTML structure (using hpple) on the server.
For example, the files are held at:
Http://myserver.com/myFolders/myFiles/
The app goes to this location and traverses the tree, using an X-Path query, e.g.
"/html/body/ul/li/a"
It then downloads whatever documents it finds to the iPad for display.
So far this works quite well but the server is publicly accessable.
My question is, how would I go about doing something similar with a secure server?
e.g. is it possible to password protect the server, connect to it with username/password from the iPad and use the same system?
In the end I decided not to parse the HTML as there seemed to be no straightforward way to do so. Instead the documents are held on an ASP.Net server with authentication required for access.
It would've been nice to know how to do so by traversing HTML but no biggie.
The tutorials I'm reading say to do that, but none of the websites I use do it. Why not?
none of the websites I use [put .htm into urls] Why not?
The simple answer would be:
Most sites offer dynamic content instead of static html pages.
Longer answer:
The file extension doesn't matter. It's all about the web server configuration.
Web server checks the extension of the file, then it knows how to handle it (send .html straight to client, run .php through mod_php and generate a html page etc.) This is configurable.
Then web server sends the content (static or generated) to the client, and the http protocol includes telling the client the type of the content in the headers before the web page is sent.
By the way, .htm is no longer needed. We don't use DOS with 8.3 filenames anymore.
To make it even more complicated: :-)
Web server can do url rewriting. For example it could redirect all urls of form : www.foo.com/photos/[imagename] to actual script located in www.foo.com/imgview.php?image=[imagename]
The .htm extension is an abomination left over from the days of 8.3 file name length limitations. If you're writing HTML, its more properly stored in a .html file. Bear in mind that a URL that you see in your browser doesn't necessarily correspond directly to some file on the server, which is why you rarely see .html or .htm in anything other than static sites.
I presume you're reading tutorials on creating static html web pages. Most sites are dynamically generated from programs that use the url to determine the content you see. The url is not tied to a file. If no such dynamic programs are present, then files are urls are synonomous.
If you can, leave off the .htm (or any file extension). It adds nothing to the use of the site, and exposes an irrelevant detail in the URL.
There's no need to put .htm in your URL's. Not only does it expose an unnecessary backend detail about your site, it also means that there is less room in your URLs for other characters.
It's true that URL's can be insanely long... but if you email a long link, it will often break. Not everyone uses TinyURL and the like, so it makes sense to keep your URL's short enough so that they don't get truncated in emails. Those four characters (.htm) might make the difference between your emailed url getting truncated or not!
I am designing a web application which is a tie in to my iPhone application. It sends massively large URLs to the web server (15000 about.) I was using NearlyFreeSpeech.net, but they only support URLS up to 2000 characters. I was wondering if anybody knows of web hosting that will support really large URLs? Thanks, Isaac
Edit: My program needs to open a picture in Safari. I could do this 2 ways:
send it base64 encoded in the URL and just echo the query parameters.
first POST it to the server in my application, then the server would send back a unique ID after storing the photo in a database, which I would append to a URL which I would open in Safari which retrieved the photo from the database and delete it from the database.
You see, I am lazy, and I know Mobile Safari can support URI's up to 80 000 characters, so I think this is a OK way to do it. If there is something really wrong with this, please tell me.
Edit: I ended up doing it the proper POST way. Thanks.
If you're sending 15,000 character long URLs, in all likelyhood:
alt text http://img16.imageshack.us/img16/3847/youredoingitwronga.jpg
Use something like an HTTP POST instead.
The limitations you're running up against aren't so much an issue with the hosts - it's more the fact that web servers have a limit for the length of a URL. According to this page, Apache limits you to around 4k characters, and IIS limits you to 16k by default.
Although it's not directly answering your question, and there is no official maximum length of a URL, browsers and servers have practical limits - see http://www.boutell.com/newfaq/misc/urllength.html for some details. In short, since IE (at least some versions in use) doesn't support URLs over 2,083 characters, it's probably wise to stay below that length.
If you need to just open it in Safari, and the server doesn't need to be involved, why not use a data: URI?
Sending long URIs over the network is basically never the right thing to do. As you noticed, some web hosts don't support long URIs. Some proxy servers may also choke on long URLs, which means that your app might not work for users who are behind those proxies. If you ever need to port your app to a different browser, other browsers may not support URIs that long.
If you need to get data up to a server, use a POST. Yes, it's an extra round trip, but it will be much more reliable.
Also, if you are uploading data to the server using a GET request, then you are vulnerable to all kinds of cross-site request forgery attacks; basically, an attacker can trick the user into uploading, say, goatse to their account simply by getting them to click on a link (perhaps hidden by TinyURL or another URL shortening service, or just embedded as a link in a web page when they don't look closely at the URL they're clicking on).
You should never use GET for sending data to the server, beyond query parameters that don't actually change anything on the server.