I'm going to allow users to set an image with a link on my site. e.g. a profile picture and a profile link.
I will not let them upload said image, but let them give an url that i will insert into a img src.
I want to do basic checks for the best know xss patterns someone may use my site for, but thing is, i have no list of samples to check my functions works. As it is, even if I write a full RFC compliant parser to check every aspect of the URL, i will still not know what i should guard against.
I would do the following
Check that the URIs start with http:// or https://
URI encode the URI before printing it on the img src.
The first one is to make sure no javascript:, vbscript: etc. URLS are allowed. The second one is to escape any character that can cause damage (like ", ', <, > etc.).
Still a good resource for pattern, though a bit dated: http://ha.ckers.org/xss.html
Another great resource: http://html5sec.org
Scan it using the OWASP Zed Attack Proxy (ZAP).
ZAP is a free open source security tool, and its very good at finding XSS vulnerabilities.
Simon (ZAP Project Lead)
Related
I'm developing a SPA web app and it will support various languages. It is build with AngularJS and I am using angular-translate to provide i18n.
But I am struggling a little bit with how the URL structure should be. I do no plan on using either gTLDs nor ccTLDs, so that leaves me with three options.
Use query params: ?locale=en-us
Use url paths: /en-us/page
Store the chosen locale in localStorage or a cookie
The first option is a no-go according to Google's guidelines for web apps SEO. So that leaves me with the last two options.
I have a hard time deciding which is more beneficial, though I am inclined to believe that using url paths would probably be more crawler friendly.
P.S: Not sure if this is the best place to ask such a question either.
The second option is your safest bet as according to https://webmasters.stackexchange.com/questions/59652/what-happens-if-i-try-to-set-a-cookie-on-a-bot cookies are ignored. You can test this yourself by going to the Google Console and fetching your website.
As of now most crawlers ignore cookies and DO NOT execute JavaScript. This means that they usually just download the html and make their judgements from there.
Some developers get around the no javascript problem by pre-rendering parts of their content. I haven't done it personally but you might want to check out https://prerender.io/
Edit
As rolandjitsu mentioned google crawls and executes javascript content.
You should go with second option: provide the language tag (and, optionally, region subtags) in the URL path as first segment.
For the simple reason that it allows you, visitors, and bots to link to specific translations.
I came across a website with a blog post teaching all how to clear cache for web development purposes. My personal favourite one is to do /? on the end of a web address at the URL bar.
Are there any more little codes like that? if so what are they and where can I find a cheat sheet?
Appending /? may work for some URLs, but not for all.
It works if the server/site is configured in a way that, for example, http://example.com/foo and http://example.com/foo/? deliver the same document. But this is not the case for all servers/sites, and the defaults can be changed anyway.
There is no name for this. You just manipulate the canonical URL, hoping to craft a URL that points to the same document, without getting redirected.
Other common variants?
I’d expect appending ? would work even more often than /? (both, of course, only work if the URL has no query component already).
http://example.com/foo
http://example.com/foo?
You’ll also find sites that allow any number of additional slashes where only one slash used to be.
http://example.com/foo/bar
http://example.com/foo////bar
Not sure if it affects the cache, but specifying the domain as FQDN, by adding a dot after the TLD, would work for many sites, too.
http://example.com/foo
http://example.com./foo
Some sites might not have case-sensitive paths.
http://example.com/foo
http://example.com/fOo
We have built a "redirect" engine into our product so our customers can add/edit/delete custom redirects without us having to maintain a bunch of rewrite rules on the server.
Some issues are arising in the URLs that get passed into our code. We are pulling these from the CGI.QUERY_STRING property populated by Coldfusion (it picks up on 404's thrown by IIS/Coldfusion, which appends the bad URL as a query string like ?404;http://www.mysite.com:80/nonexistent-file.cfm).
What we see is that some special characters are getting an additional character thrown in there (an Â) character. Take this URL (%A9 is the copyright symbol):
http://www.mysite.com/%A9/
The CGI.QUERY_STRING is reporting this as:
http://www.mysite.com:80/©/
I have no idea where this extra "Â" is coming from. I have a hunch that this is being brought in by IIS, but it could also be with Coldfusion as it has to populate the CGI variable.
Any ideas as to why this is happening and how to fix it? It appears that not all percent-encoded/special characters do this...
EDIT:
I am probably giving up on my exact problem, however, it would be beneficial still to know why either IIS or Coldfusion is throwing in this extra character (especially for certain escape sequences over others).
Wow... that's a tough one. Usually folks design sites to use alphanumeric plus the tilde (~) and dash (=). I'm not even sure if the RFC allows for a copywrite symbol as part of the host header. I'm not positive that it should be allowed in the scheme portion of the URL. This article might shed some light on it for you. As for IIS - you might be able to add a specific rewrite rule that takes care of the issue. Personally I would avoid these characters in the schema part of the URL.
When should a trailing slash be used in a URL? For example - should my URL look like /about-us/ or like /about-us?
I am fully aware of the SEO-related issues - duplicate content and the canonical thing; I'm trying to figure out which one I should use in the context of serving pages correctly alone.
For example, my colleague is thinking that a trailing slash at the end means it's a "folder" - a "directory", so this is not a correct style. But I think that without a slash in the end - it's not quite correct either, because it almost looks like a folder, but it isn't and it's not a normal file either, but a filename without extension.
Is there a proper way of knowing which to use?
It is not a question of preference. /base and /base/ have different semantics. In many cases, the difference is unimportant. But it is important when there are relative URLs.
child relative to /base/ is /base/child.
child relative to /base is (perhaps surprisingly) /child.
In my personal opinion trailing slashes are misused.
Basically the URL format came from the same UNIX format of files and folders, later on, on DOS systems, and finally, adapted for the web.
A typical URL for this book on a Unix-like operating system would be a file path such as file:///home/username/RomeoAndJuliet.pdf, identifying the electronic book saved in a file on a local hard disk.
Source: Wikipedia: Uniform Resource Identifier
Another good source to read: Wikipedia: URI Scheme
According to RFC 1738, which defined URLs in 1994, when resources contain references to other resources, they can use relative links to define the location of the second resource as if to say, "in the same place as this one except with the following relative path". It went on to say that such relative URLs are dependent on the original URL containing a hierarchical structure against which the relative link is based, and that the ftp, http,
and file URL schemes are examples of some that can be considered hierarchical, with the components of the hierarchy being separated by "/".
Source: Wikipedia Uniform Resource Locator (URL)
Also:
That is the question we hear often. Onward to the answers! Historically, it’s common for URLs with a trailing slash to indicate a directory, and those without a trailing slash to
denote a file:
http://example.com/foo/ (with trailing slash, conventionally a directory)
http://example.com/foo (without trailing slash, conventionally a file)
Source: Google WebMaster Central Blog - To slash or not to slash
Finally:
A slash at the end of the URL makes the address look "pretty".
A URL without a slash at the end and without an extension looks somewhat "weird".
You will never name your CSS file (for example) http://www.sample.com/stylesheet/ would you?
BUT I'm being a proponent of web best practices regardless of the environment.
It can be wonky and unclear, just as you said about the URL with no ext.
I'm always surprised by the extensive use of trailing slashes on non-directory URLs (WordPress among others). This really shouldn't be an either-or debate because putting a slash after a resource is semantically wrong. The web was designed to deliver addressable resources, and those addresses - URLs - were designed to emulate a *nix-style file-system hierarchy. In that context:
Slashes always denote directories, never files.
Files may be named anything (with or without extensions), but cannot contain or end with slashes.
Using these guidelines, it's wrong to put a slash after a non-directory resource.
That's not really a question of aesthetics, but indeed a technical difference. The directory thinking of it is totally correct and pretty much explaining everything. Let's work it out:
You are back in the stone age now or only serve static pages
You have a fixed directory structure on your web server and only static files like images, html and so on — no server side scripts or whatsoever.
A browser requests /index.htm, it exists and is delivered to the client. Later you have lots of - let's say - DVD movies reviewed and a html page for each of them in the /dvd/ directory. Now someone requests /dvd/adams_apples.htm and it is delivered because it is there.
At some day, someone just requests /dvd/ - which is a directory and the server is trying to figure out what to deliver. Besides access restrictions and so on there are two possibilities: Show the user the directory content (I bet you already have seen this somewhere) or show a default file (in Apache it is: DirectoryIndex: sets the file that Apache will serve if a directory is requested.)
So far so good, this is the expected case. It already shows the difference in handling, so let's get into it:
At 5:34am you made a mistake uploading your files
(Which is by the way completely understandable.) So, you did something entirely wrong and instead of uploading /dvd/the_big_lebowski.htm you uploaded that file as dvd (with no extension) to /.
Someone bookmarked your /dvd/ directory listing (of course you didn't want to create and always update that nifty index.htm) and is visiting your web-site. Directory content is delivered - all fine.
Someone heard of your list and is typing /dvd. And now it is screwed. Instead of your DVD directory listing the server finds a file with that name and is delivering your Big Lebowski file.
So, you delete that file and tell the guy to reload the page. Your server looks for the /dvd file, but it is gone. Most servers will then notice that there is a directory with that name and tell the client that what it was looking for is indeed somewhere else. The response will most likely be be:
Status Code:301 Moved Permanently with Location: http://[...]/dvd/
So, totally ignoring what you think about directories or files, the server only can handle such stuff and - unless told differently - decides for you about the meaning of "slash or not".
Finally after receiving this response, the client loads /dvd/ and everything is fine.
Is it fine? No.
"Just fine" is not good enough for you
You have some dynamic page where everything is passed to /index.php and gets processed. Everything worked quite good until now, but that entire thing starts to feel slower and you investigate.
Soon, you'll notice that /dvd/list is doing exactly the same: Redirecting to /dvd/list/ which is then internally translated into index.php?controller=dvd&action=list. One additional request - but even worse! customer/login redirects to customer/login/ which in turn redirects to the HTTPS URL of customer/login/. You end up having tons of unnecessary HTTP redirects (= additional requests) that make the user experience slower.
Most likely you have a default directory index here, too: index.php?controller=dvd with no action simply internally loads index.php?controller=dvd&action=list.
Summary:
If it ends with / it can never be a file. No server guessing.
Slash or no slash are entirely different meanings. There is a technical/resource difference between "slash or no slash", and you should be aware of it and use it accordingly. Just because the server most likely loads /dvd/index.htm - or loads the correct script stuff - when you say /dvd: It does it, but not because you made the right request. Which would have been /dvd/.
Omitting the slash even if you indeed mean the slashed version gives you an additional HTTP request penalty. Which is always bad (think of mobile latency) and has more weight than a "pretty URL" - especially since crawlers are not as dumb as SEOs believe or want you to believe ;)
When you make your URL /about-us/ (with the trailing slash), it's easy to start with a single file index.html and then later expand it and add more files (e.g. our-CEO-john-doe.jpg) or even build a hierarchy under it (e.g. /about-us/company/, /about-us/products/, etc.) as needed, without changing the published URL. This gives you a great flexibility.
Other answers here seem to favor omitting the trailing slash. There is one case in which a trailing slash will help with search engine optimization (SEO). That is the case that your document has what appears to be a file extension that is not .html. This becomes an issue with sites that are rating websites. They might choose between these two urls:
http://mysite.example.com/rated.example.com
http://mysite.example.com/rated.example.com/
In such a case, I would choose the one with the trailing slash. That is because the .com extension is an extension for Windows executable command files. Search engines and virus checkers often dislike URLs that appear that they may contain malware distributed through such mechanisms. The trailing slash seems to mitigate any concerns, allowing the page to rank in search engines and get by virus checkers.
If your URLs have no . in the file portion, then I would recommend omitting the trailing slash for simplicity.
Who says a file name needs an extension?? take a look on a *nix machine sometime...
I agree with your friend, no trailing slash.
From an SEO perspective, choosing whether or not to include a trailing slash at the end of a URL is irrelevant. These days, it is common to see examples of both on the web. A site will not be penalized either way, nor will this choice affect your website's search engine ranking or other SEO considerations.
Just choose a URL naming convention you prefer, and include a canonical meta tag in the <head> section of each webpage.
Search engines may consider a single webpage as two separate duplicate URLS when they encounter it with and without the trailing slash, ie example.com/about-us/ and example.com/about-us.
It is best practice to include a canonical meta tag on each page because you cannot control how other sites link to your URLs.
The canonical tag looks like this: <link rel="canonical" href="https://example.com/about-us" />. Using a canonical meta tag ensures that search engines only count each of your URLs once, regardless of whether other websites include a trailing slash when they link to your site.
The trailing slash does not matter for your root domain or subdomain. Google sees the two as equivalent.
But trailing slashes do matter for everything else because Google sees the two versions (one with a trailing slash and one without) as being different URLs.
Conventionally, a trailing slash (/) at the end of a URL meant that the URL was a folder or directory.
A URL without a trailing slash at the end used to mean that the URL was a file.
Read more
Google recommendation
I feel dumb for not knowing this, but I see a lot of links in web pages and instead of this:
<a href="http://foo.com/">
...they use this:
<a href="http://foo.com/?src=bar.com">
Now I understand that the ?src= is telling something that this referral is coming from bar.com, but I don't understand why this needs to be called out explicitly. Can anyone shed some light on it for me? Is this something I need to include in my program generated links?
EDIT: Ok, sorry, I'm not being clear enough. I understand the GET syntax with a question mark and parameters separated by ampersands. I'm wondering what's this special src parameter? Why would one site link to another and tack an src parameter on the end even though there's no indication that the destination site uses this normally.
For example, on this page hover your mouse over the screenshot. The link URL is http://moms4mom.com/?src=stackexchangesites
But moms4mom.com is our site. Passing the src parameter does nothing, so why include it?
There are a few reasons that the src is being used explicitly. But in general, it is easier and more reliable to trust a query string to determine referer[sic] than it is to trust the referer, since the latter is often broken, deliberately or not. On the other hand, browsers almost never break the query string in a url, since this, unlike referers, is pretty important for pages to function. Besides, a referer is often done without any deliberate action on the part of the site doing the refering, which some users dislike.
The reason (I do it) is that popular analytics tools sometimes make it easier to filter on query strings than referrers.
There is no standard to the src parameter. Each site has its own and it's usually up to the site that gets the link to define how it wants to read it (as usually it's that site that's going to pay for the click).
The second is a dynamic link, it's a URL that another language(like ASP and PHP) interpret as something to do, like in those Google URLs, but i never used this site(foo.com), then i don't much things about this parameter.
Depending on how the site processes its URL, you may or may not need to include the ?... information.
This is passed to the website, and the server can process it just like form input. Some sites require this - and build their navigation off a single page, using nothing but the "extra" stuff passed afterwards. If you're generating a link to a site like that, it will be required.
In other cases, this is just used to pass extra, unrequired info (such as advertising, tracking info, etc)... In those cases, you can leave it off.
Unfortunately, there's no way to know without trying whether you can remove the "extra" bits from the URL.
After reading some of your comments - I'll also say:
There is nothing special about the "src" field in a query string. The server is free to use it any way it wishes. Unless you know specific info about the server, you cannot assume it can be left out.
The part after the ? is the query string. Different sites use it for different things, and it is usually used for passing information to the server side code for that URL, but can also be used in javascript.
For more info see Query String