url encode & url escape & url rewrite, what's the differences? - url

It's kinda confusing to differenciate those three terms.
It'll be more understandable if you can explain with examples.

Url encoding and Url escaping are one and the same..
URL Encoding is a process of transforming user input to a CGI form so it is fit for travel across the network; basically, stripping spaces and special characters present in the url, replacing them with escape characters.
URL rewriting changes the way you normally associate urls with resources. Normally, test.com/aboutus makes us think that it will take us to the about us page. But internally, Server may take user 1 to /aboutus/page1.html, user 2 to /aboutus/page2.html or any other resource. The Url exposed to the end user will be test.com/aboutus but the resource being rendered can be different. Note that Url Rewriting is performed by Server.

Related

Language specific characters in URL

Colleagues from work have created API endpoint which uses language specific characters in url. This api url looks like
http://somedomain.com/someapi/somemethod/zażółć/gęślą/jaźń
Is this OK or is it a bad approach?
Technically, that's not a valid URL but web browsers and other clients finesse it. The script that characters are from is not an issue but structural characters like "/?#" could be. You'll have to consider what to do when they show up in data that you are "pasting" into your URLs.
An HTTP URL is:
an ASCII-encoded scheme (in this case the protocol "http")
a punycode-encoded, ASCII-encoded domain
a %-encoded, ASCII-encoded, server-defined sequence of octets for the path, optional query, and optional hash.
See RFC 3986
The assumption that everyone makes—quite reasonably because it is the predominant practice—is that the path, query, and hash are text. There is no text but encoded text. So, some character encoding is involved. Where %-encoding is needed outside of structural characters, browsers are going to assume UTF-8. If you don't want browsers to do the %-encoding, use valid URLs by doing it yourself with the character encoding that you are using.
As the world is standardizing on UTF-8 (where applicable), the HTML DOM has also with the encodeURIComponent function. Clients using JavaScript in a web browser are likely to use this function, either directly or through some library.
UTF-8 encoded, %-encoded (and, then on the wire, ASCII-encoded) version of your URL that my browser created:
http://somedomain.com/someapi/somemethod/za%C5%BC%C3%B3%C5%82%C4%87/g%C4%99%C5%9Bl%C4%85/ja%C5%BA%C5%84
(You can see this yourself using your browser's dev tools [F12 key, network tab] or a packet sniffer [e.g., Wireshark or Fiddler]. What you gave as a URL is never seen on the wire.)
Your server application probably understands that just fine. In any case, it is your server's rules that the client complies with. If your API uses UTF-8 encoded, %-encoded URLs then just document that. (But phrase it in a way that doesn't confuse people who do that already without knowing.)

safe character to separate multiple urls

I am preparing a special string, in which keys are values are concatenated like below:
username=foo&age=24&email=foo#bar.com&homepage=http://foo.com
& is the separator for two key=value pairs
value is url encoded
I have a scenario where there are multiple home pages for a user.
I want to specify multiple urls for the homepage key
name=foo&age=24&email=foo#bar.com&homepage=url1<some_safe_url_separator_char>url2<some_safe_url_separator_char>url3
We have no control/idea over what url1, url2, .. may contain?
What is a good choice of some_safe_url_separator_char?
In other words I am not looking for a safe character to be used IN a url, but a safe character to be used to SEPARATE two urls in a string
well you can use URL re-writing for this .
It will make a URL that will be safe as it will hide the name of parameters
For refrence you can use URL rewriting
URL rewriting will make a url seprated by '/' and its tough to be decoded by an external person.
you can follow links i'm posting
URL rewriting for beginners

How do SO URLs self correct themselves if they are mistyped?

If an extra character (like a period, comma or a bracket or even alphabets) gets accidentally added to URL on the stackoverflow.com domain, a 404 error page is not thrown. Instead, URLs self correct themselves & the user is led to the relevant webpage.
For instance, the extra 4 letters I added to the end of a valid SO URL to demonstrate this would be automatically removed when you access the below URL -
https://stackoverflow.com/questions/194812/list-of-freely-available-programming-booksasdf
I guess this has something to do with ASP.NET MVC Routing. How is this feature implemented?
Well, this is quite simple to explain I guess, even without knowing the code behind it:
The text is just candy for search engines and people reading the URL:
This URL will work as well, with the complete text removed!
The only part really important is the question ID that's also embedded in the "path".
This is because EVERYTHING after http://stackoverflow.com/questions/194812 is ignored. It is just there to make the link, if posted somewhere, if more speaking.
Internally the URL is mapped to a handler, e.g., by a rewrite, that transforms into something like: http://stackoverflow.com/questions.php?id=194812 (just an example, don't know the correct internal URL)
This also makes the URL search engine friendly, besides being more readable to humans.

ASP.NET MVC Colon in URL

I've seen that IIS has a problem with letting colons into URLs. I also saw the suggestions others offered here.
With the site I'm working on, I want to be able to pass titles of movies, books, etc., into my URL, colon included, like this:
mysite.com/Movie/Bob:The Return
This would be consumed by my MovieController, for example, as a string and used further down the line.
I realize that a colon is not ideal. Does anyone have any other suggestions? As poor as it currently is, I'm doing a find-and-replace from all colons (:) to another character, then a backwards replace when I want to consume it on the Controller end.
I resolved this issue by adding this to my web.config:
<httpRuntime requestPathInvalidCharacters=""/>
This must be within the system.web section.
The default is:
<httpRuntime requestPathInvalidCharacters="<,>,*,%,&,:,\,?"/>
So to only make an exception for the colon it would become
<httpRuntime requestPathInvalidCharacters="<,>,*,%,&,\,?"/>
Read more at: http://msdn.microsoft.com/en-us/library/system.web.configuration.httpruntimesection.requestpathinvalidcharacters.aspx
For what I understand the colon character is acceptable as an unencoded character in an URL. I don't know why they added it to the default of the requestPathInvalidCharacters.
Consider URL encoding and decoding your movie titles.
You'd end up with foo.com/bar/Bob%58The%20Return
As an alternative, consider leveraging an HTML helper to remove URL unfriendly characters in URLs (method is URLFriendly()). The SEO benefits between a colon and a placeholder (e.g. a dash) would likely be negligable.
One of the biggest worries with your approach is that the movie name isn't always going to be unique (e.g. "The Italian Job"). Also what about other ilegal characters (e.g. brackets etc).
It might be a good idea to use an id number in the url to locate the movie in your database. You could still include a url friendly copy of movie name in your url, but you wouldn't need to worry about getting back to the original title with all the illegal characters in it.
A good example is the url to this page. You can see that removing the title of the page still works:
ASP.NET MVC Colon in URL
ASP.NET MVC Colon in URL
Colon is a reserved and invalid character in an URI according to the RFC 3986. So don't do something that violates the specification. You need to either URL encode it or use another character. And here's a nice blog post you might take a look at.
The simplest way is to use System.Web.HttpUtility.UrlEncode() when building the url
and System.Web.HttpUtility.UrlDecode when interpreting the results coming back. You would also have problems with the space character if you don't encode the value first.

Why we don't use such URL formats?

I am reworking on the URL formats of my project. The basic format of our search URLs is this:-
www.projectname/module/search/<search keyword>/<exam filter>/<subject filter>/... other params ...
On searching with no search keyword and exam filter, the URL will be :-
www.projectname/module/search///<subject filter>/... other params ...
My question is why don't we see such URLs with back to back slashes (3 slashes after www.projectname/module/search)? Please note that I am not using .htaccess rewrite rules in my project anymore. This URL works perfect functionally. So, should I use this format?
For more details on why we chose this format, please check my other question:-
Suggest best URL style
Web servers will typically remove multiple slashes before the application gets to see the request,for a mix of compatibility and security reasons. When serving plain files, it is usual to allow any number of slashes between path segments to behave as one slash.
Blank URL path segments are not invalid in URLs but they are typically avoided because relative URLs with blank segments may parse unexpectedly. For example in /module/search, a link to //subject/param is not relative to the file, but a link to the server subject with path /param.
Whether you can see the multiple-slash sequences from the original URL depends on your server and application framework. In CGI, for example (and other gateway standards based on it), the PATH_INFO variable that is typically used to implement routing will usually omit multiple slashes. But on Apache there is a non-standard environment variable REQUEST_URI which gives the original form of the request without having elided slashes or done any %-unescaping like PATH_INFO does. So if you want to allow empty path segments, you can, but it'll cut down on your deployment options.
There are other strings than the empty string that don't make good path segments either. Using an encoded / (%2F), \ (%5C) or null byte (%00) is blocked by default by many servers. So you can't put any old string in a segment; it'll have to be processed to remove some characters (often ‘slug’-ified to remove all but letters and numbers). Whilst you are doing this you may as well replace the empty string with _.
Probably because it's not clearly defined whether or not the extra / should be ignored or not.
For instance: http://news.bbc.co.uk/sport and http://news.bbc.co.uk//////////sport both display the same page in Firefox and Chrome. The server is treating the two urls as the same thing, whereas your server obviously does not.
I'm not sure whether this behaviour is defined somewhere or not, but it does seem to make sense (at least for the BBC website - if I type an extra /, it does what I meant it to do.)

Resources