Can someone explain to me the following behavior:
When I copy a specific URL from Firefox and paste it to Notepad++ (or Stack Overlow) some parameters changed. I can't post the orginial URL but it's something like this:
In address bar:
https://xxx.xxx-xxxx.de/xxxxx//xxxxx?project=xxxxl&query=xxxx&keyname=OBJNAME&keyvalue=05-(G)28-01-008
But Notepad++ and Stack Overflow shows me this:
https://xxx.xxx-xxxx.de/xxxxx//xxxxx?project=xxxx&query=xxxxx&keyname=OBJNAME&keyvalue=05%2D%28G%2928%2D01%2D008
This URI is percent-encoded.
See the URI standard: Percent-Encoding. There are various reasons why something could be percent-encoded (sometimes it’s required, sometimes optional; sometimes it changes the meaning, sometimes not).
Some browsers display the URL after decoding the percent-encoded parts. This is typically done for usability reasons: it’s nicer to see https://en.wikipedia.org/wiki/Ö instead of https://en.wikipedia.org/wiki/%C3%96.
Copy-pasting the URL from the address bar to some other place is one way to see the actual URL (with percent-encoding).
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
When I copy a UTF-8 URL from the browser's address bar (almost any browser on any os), then try to paste it in another text field (to post it on facebook or twitter for example), it gives only the decoded URL, which makes it ugly. For example, in the address bar, the URL appears like this one:
https://www.chaino.com/وذكر
But when trying to copy and paste it in any other place, it gives the following ugly url:
https://www.chaino.com/%D9%88%D8%B0%D9%83%D8%B1
& if I wanted to get the original URL to use it in any place, I used to decode it in this Raw URL Decoder - Online Tool
Question is: is there a short direct way to copy these kind of URLs, and paste it without this hideous process? (may be using chrome extensions or something)
You can add a 'space' at the end of the URL in the address bar, then you can select it all and copy it directly.
You can select URL without selecting scheme (e.g. http://), and copy it. This will give you what you expected.
P.S. The point is to select only part of the link. E.g. you can select whole URL without first character and than add it manually.
In Firefox 53+ you can set browser.urlbar.decodeURLsOnCopy about:config option to true.
The URI you get by copying from the address bar is the only valid URI the browser can give you.
From the RFC 3986 (and other URL RFCs):
A URI is a sequence of characters from a very limited set: the
letters of the basic Latin alphabet, digits, and a few special
characters.
So: https://www.chaino.com/وذكر
Is an invalid URI, yet a valid IRI (International Resource Identifier), that your browser will convert to a valid URI while requesting the server over HTTP (HTTP does not allow IRI, only URI).
TL;DR: Your browser is giving you what you expect: A valid URI that you can use everywhere, not an IRI only supported here and here.
PS If "facebook or twitter for example" are kind, they may display a readable form to their users, so don't worry about giving an encoded form.
You can use Chrome Extensions like below:
https://chrome.google.com/webstore/detail/copy-unicode-urls/fnbbfiapefhkicjhecnoepbijhanpkjp
https://chrome.google.com/webstore/detail/copy-cyrilic-urls/alnknpnpinldhpkjkgobkalmoaeolhnf
Create a bookmark with this url: javascript:console.log(prompt('copy (Control+C) this link:', decodeURIComponent(window.location))).
Click this bookmark on that page.
Example page: https://www.google.com.hk/search?q=中文
The best answer I found tell now is using this Chrome extension:
https://chrome.google.com/webstore/detail/copy-cyrilic-urls/alnknpnpinldhpkjkgobkalmoaeolhnf?hl=en-US
which enables me to copy the url (in a decoded state) with only one click :)
You can use Chrome and FireFox extension called "Copy Unicode URLs", which I created.
It is:
Open source.
Gives you an option to leave URL terminators encoded so, e.g., links that end with a dot will have that dot encoded and email clients won't wrongly recognize this dot as a sentence/URL terminator.
If you love my work then, please, donate some sum here.
Copy addres without 'h' in http...
And past addres without 'h' and sum first addres with 'h'
My crawler engine seems to have a problem with a specific customer's site.
At that site, there are redirects to URLs that look like this:
http://example.com/dir/aaa$0081 aaa.php
(Showing the URL as non-encoded, with $0081 being two bytes represented using HEX.)
Now, this is when inspecting the buffer returned after using the WinInet Windows API call HttpQueryInfo, so the two bytes actually represent a WideChar at this point.
Now, I can see that e.g. $0081 is a non-visual control character:
Latin-1 Supplement (Unicode block)
The problem is that if I use the URL "as-is" (URL encoded) for future requests to the server, it responds with 400 or 404. (On the other hand, is it removed entirely, it works and the server delivers the correct page and response...)
I suspect that FireFox/IE/etc. is stripping non-visible controls characters in URLs before making the HTTP requests... (At least IEHTTPHeaders and FF Live HTTP Headers addins don't show any non-visible characters.)
I was wondering if anyone can point to a standard for this? For what I can see non-visible chracters should not be found in URLs, so I am thinking a solution might be (in this and future cases) that I remove these. But it is not a topic that seems widely discussed on the net.
In the example given, $0081 is just five Ascii characters. But if you mean that this is just what it looks like and you have (somehow) inferred that the actual URL contains U+0081, then what should happen, and does happen at least on Firefox, is that it is %-encoded (“URL encoded”) as %C2%81 (formed by %-encoding the two bytes of the UTF-8 encoded form of U+0081. Firefox shows this as empty in its address bar, since U+0081 is control character, but the server actually gets %C2%81 and must take it from there.
I have no idea of where the space comes from, but a URL must not contain a space, except as %-encoded (%20).
The relevant standard is Internet-standard STD 66, URI Generic Syntax. (Currently RFC 3986. Beware: people still often refer to older RFCs as “standard” in this issue.)
If an extra character (like a period, comma or a bracket or even alphabets) gets accidentally added to URL on the stackoverflow.com domain, a 404 error page is not thrown. Instead, URLs self correct themselves & the user is led to the relevant webpage.
For instance, the extra 4 letters I added to the end of a valid SO URL to demonstrate this would be automatically removed when you access the below URL -
https://stackoverflow.com/questions/194812/list-of-freely-available-programming-booksasdf
I guess this has something to do with ASP.NET MVC Routing. How is this feature implemented?
Well, this is quite simple to explain I guess, even without knowing the code behind it:
The text is just candy for search engines and people reading the URL:
This URL will work as well, with the complete text removed!
The only part really important is the question ID that's also embedded in the "path".
This is because EVERYTHING after http://stackoverflow.com/questions/194812 is ignored. It is just there to make the link, if posted somewhere, if more speaking.
Internally the URL is mapped to a handler, e.g., by a rewrite, that transforms into something like: http://stackoverflow.com/questions.php?id=194812 (just an example, don't know the correct internal URL)
This also makes the URL search engine friendly, besides being more readable to humans.
I am reworking on the URL formats of my project. The basic format of our search URLs is this:-
www.projectname/module/search/<search keyword>/<exam filter>/<subject filter>/... other params ...
On searching with no search keyword and exam filter, the URL will be :-
www.projectname/module/search///<subject filter>/... other params ...
My question is why don't we see such URLs with back to back slashes (3 slashes after www.projectname/module/search)? Please note that I am not using .htaccess rewrite rules in my project anymore. This URL works perfect functionally. So, should I use this format?
For more details on why we chose this format, please check my other question:-
Suggest best URL style
Web servers will typically remove multiple slashes before the application gets to see the request,for a mix of compatibility and security reasons. When serving plain files, it is usual to allow any number of slashes between path segments to behave as one slash.
Blank URL path segments are not invalid in URLs but they are typically avoided because relative URLs with blank segments may parse unexpectedly. For example in /module/search, a link to //subject/param is not relative to the file, but a link to the server subject with path /param.
Whether you can see the multiple-slash sequences from the original URL depends on your server and application framework. In CGI, for example (and other gateway standards based on it), the PATH_INFO variable that is typically used to implement routing will usually omit multiple slashes. But on Apache there is a non-standard environment variable REQUEST_URI which gives the original form of the request without having elided slashes or done any %-unescaping like PATH_INFO does. So if you want to allow empty path segments, you can, but it'll cut down on your deployment options.
There are other strings than the empty string that don't make good path segments either. Using an encoded / (%2F), \ (%5C) or null byte (%00) is blocked by default by many servers. So you can't put any old string in a segment; it'll have to be processed to remove some characters (often ‘slug’-ified to remove all but letters and numbers). Whilst you are doing this you may as well replace the empty string with _.
Probably because it's not clearly defined whether or not the extra / should be ignored or not.
For instance: http://news.bbc.co.uk/sport and http://news.bbc.co.uk//////////sport both display the same page in Firefox and Chrome. The server is treating the two urls as the same thing, whereas your server obviously does not.
I'm not sure whether this behaviour is defined somewhere or not, but it does seem to make sense (at least for the BBC website - if I type an extra /, it does what I meant it to do.)
Currently, when I try to hit certain pages of my site via something like http://www.domain.com/< (which is a valid URL), I get a blank page with the text "Bad Request" on it (and nothing else). This happens with both the escaped and unescaped version of the URL.
I'm fairly certain this is due to IIS6 not liking the < character (which, in general, is valid). Is there a way to stop IIS6 from filtering these characters and giving me this error page?
(I've found similar solutions for IIS7, but nothing has worked in IIS6 so far.)
UPDATE: The URL is being transformed already, ie. hitting domain.com/%3C will also give the "Bad Request" page.
Not sure if this will work, but this got me out of a similar jam caused by design types forgetting key parts of query strings. Sounds like you might have a similar issue. Anyhow, try making a virtual directory called %3c and then having that redirect to where appropriate.
RFC 1738:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
< transforms to %3C
https://stackoverflow.com/<