Prevent XSS attacks and still use Html.Raw - asp.net-mvc

I have CMS system where I am using CK Editor to enter data. Now if user types in <script>alert('This is a bad script, data');</script> then CKEditor does the fair job and encodes it correctly and passes <script>alert('This is a bad script, data')</script> to server.
But if user goes into browser developer tools (using Inspect element) and adds this inside it as shown in the below screen shot then this is when all the trouble starts. Now after retrieving back from DB when this is displayed in Browser it presents alert box.
So far I have tried many different things one them is
Encode the contents using AntiXssEncoder [HttpUtility.HtmlEncode(Contents)] and then store it in database and when displaying back in browser decode it and display it using MvcHtmlString.Create [MvcHtmlString.Create(HttpUtility.HtmlDecode(Contents))] or Html.Raw [Html.Raw(Contents)] as you may expect both of them displays JavaScript alert.
I don't want to replace the <script> manually thru code as it is not comprehensive solution (search for "And the encoded state:").
So far I have referred many articles (sorry not listing them all here but just adding few as proof to show I have put sincere efforts before writing this question) but none of them have code which shows the answer. May be there is some easy answer and I am not looking in right direction or may be it is not that simple at all and I may need to use something like Content Security Policy.
ASP.Net MVC Html.Raw with AntiXSS protection
Is there a risk in using #Html.Raw?
http://blog.simontimms.com/2013/01/21/content-security-policy-for-asp-net-mvc/
http://blog.michaelckennedy.net/2012/10/15/understanding-text-encoding-in-asp-net-mvc/
To reproduce what I am saying go to *this url and in the text box type <script>alert('This is a bad script, data');</script> and click the button.
*This link is from Michael Kennedy's blog

It isn't easy and you probably don't want to do this. May I suggest you use a simpler language than HTML for end user formatted input? What about Markdown which (I believe) is used by Stackoverflow. Or one of the existing Wiki or other lightweight markup languages?
If you do allow Html, I would suggest the following:
only support a fixed subset of Html
after the user submits content, parse the Html and filter it against a whitelist of allowed tags and attributes.
be ruthless in filtering and eliminating anything that you aren't sure about.
There are existing tools and libraries that do this. I haven't used it, but I did stumble on http://htmlpurifier.org/. I assume there are many others. Rick Strahl has posted one example for .NET, but I'm not sure if it is complete.
About ten years ago I attempted to write my own whitelist filter. It parsed and normalized the entered Html. Then it removed any elements or attributes that were not on the allowed whitelist. It worked pretty well, but you never know what vulnerabilities you've missed. That project is long dead, but if I had to do it over I would have used an existing simpler markup language rather than Html.
There are so many ways for users to inject nasty stuff into your pages, you have to be fierce to prevent this. Even CSS can be used to inject executable expressions into your page, like:
<STYLE type="text/css">BODY{background:url("javascript:alert('XSS')")}</STYLE>
Here is a page with a list of known attacks that will keep you up at night. If you can't filter and prevent all of these, you aren't ready for untrusted users to post formatted content viewable by the public.
Right around the time I was working on my own filter, MySpace (wow I'm old) was hit by an XSS Worm known as Samy. Samy used Style attributes with embedded background Url that had a javascript payload. It is all explained by the author.
Note that your example page says:
This page is meant to accept and display raw HTML by trusted
editors.
The key issue here is trust. If all of your users are trusted (say employees of a web site), then the risk here is lower. However, if you are building a forum or social network or dating site or anything that allows untrusted users to enter formatted content that will be viewable by others, you have a difficult job to sanitize Html.

I managed to resolve this issue using the HtmlSanitizer in NuGet:
https://github.com/mganss/HtmlSanitizer
as recommended by the OWASP Foundation (as good a recommendation as I need):
https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.236_-_Sanitize_HTML_Markup_with_a_Library_Designed_for_the_Job
First, add the NuGet Package:
> Install-Package HtmlSanitizer
Then I created an extension method to simplify things:
using Ganss.XSS;
...
public static string RemoveHtmlXss(this string htmlIn, string baseUrl = null)
{
if (htmlIn == null) return null;
var sanitizer = new HtmlSanitizer();
return sanitizer.Sanitize(htmlIn, baseUrl);
}
I then validate within the controller when the HTML is posted:
var cleanHtml = model.DodgyHtml.RemoveHtmlXss();
AND for completeness, sanitise whenever you present it to the page, especially when using Html.Raw():
<div>#Html.Raw(Model.NotSoSureHtml.RemoveHtmlXss())</div>

Related

Making a website without hyperlinks

I am making a simple CMS to use in my own blog and I have been using the following code to display articles.
xmlhttp.onreadystatechange=function(){
if (xmlhttp.readyState==4 && xmlhttp.status==200){
document.getElementById("maincontent").innerHTML=xmlhttp.responseText;
}
}
What it does is it sends a request to the database and gets the content associated with the article that was clicked on and writes it to the main viewing area with ".innerHTML".
Thus I don't have actual links to other articles. I know that I can use PHP to output HTML so that it forms a link like :
<a href=getcontent.php?q=article+title>Article Title</a>
But being slightly OCD I wanted my output to be as neat as possible. Although search engine visibility is not a concern for my personal blog I intend to adapt this to a few other sites which have search engine optimization as a priority.
From what I understand, basically search engine robots follow links to index the web sites.
My question is:
Does this practice have any negative implications for search engine visibility? Also; are there other reasons for preferring one approach over the other as I see that almost every site uses the 'link' method.
The link you've written will cause a page reload. In order to leverage the standard AJAX stuff you've got at the top, you need to write the links as something along the lines of
Article Title
This assumes you have a javascript function called ajaxGet that takes an argument of the identifier for the article you're searching for.
If you were to write your entire site that way, search engines wouldn't be able to crawl you at all since they don't execute javascript. Therefore they can't get to anything off the front page. Also, even if they could follow the links, they'd have no way of referencing the page they got to since it doesn't have a unique URL. This is also annoying for users, since they can't get a link to an exact story to bookmark, send to a friend etc.

Why do some websites have "#!" in the URL [duplicate]

I've just noticed that the long, convoluted Facebook URLs that we're used to now look like this:
http://www.facebook.com/example.profile#!/pages/Another-Page/123456789012345
As far as I can recall, earlier this year it was just a normal URL-fragment-like string (starting with #), without the exclamation mark. But now it's a shebang or hashbang (#!), which I've previously only seen in shell scripts and Perl scripts.
The new Twitter URLs now also feature the #! symbols. A Twitter profile URL, for example, now looks like this:
http://twitter.com/#!/BoltClock
Does #! now play some special role in URLs, like for a certain Ajax framework or something since the new Facebook and Twitter interfaces are now largely Ajaxified?
Would using this in my URLs benefit my Web application in any way?
This technique is now deprecated.
This used to tell Google how to index the page.
https://developers.google.com/webmasters/ajax-crawling/
This technique has mostly been supplanted by the ability to use the JavaScript History API that was introduced alongside HTML5. For a URL like www.example.com/ajax.html#!key=value, Google will check the URL www.example.com/ajax.html?_escaped_fragment_=key=value to fetch a non-AJAX version of the contents.
The octothorpe/number-sign/hashmark has a special significance in an URL, it normally identifies the name of a section of a document. The precise term is that the text following the hash is the anchor portion of an URL. If you use Wikipedia, you will see that most pages have a table of contents and you can jump to sections within the document with an anchor, such as:
https://en.wikipedia.org/wiki/Alan_Turing#Early_computers_and_the_Turing_test
https://en.wikipedia.org/wiki/Alan_Turing identifies the page and Early_computers_and_the_Turing_test is the anchor. The reason that Facebook and other Javascript-driven applications (like my own Wood & Stones) use anchors is that they want to make pages bookmarkable (as suggested by a comment on that answer) or support the back button without reloading the entire page from the server.
In order to support bookmarking and the back button, you need to change the URL. However, if you change the page portion (with something like window.location = 'http://raganwald.com';) to a different URL or without specifying an anchor, the browser will load the entire page from the URL. Try this in Firebug or Safari's Javascript console. Load http://minimal-github.gilesb.com/raganwald. Now in the Javascript console, type:
window.location = 'http://minimal-github.gilesb.com/raganwald';
You will see the page refresh from the server. Now type:
window.location = 'http://minimal-github.gilesb.com/raganwald#try_this';
Aha! No page refresh! Type:
window.location = 'http://minimal-github.gilesb.com/raganwald#and_this';
Still no refresh. Use the back button to see that these URLs are in the browser history. The browser notices that we are on the same page but just changing the anchor, so it doesn't reload. Thanks to this behaviour, we can have a single Javascript application that appears to the browser to be on one 'page' but to have many bookmarkable sections that respect the back button. The application must change the anchor when a user enters different 'states', and likewise if a user uses the back button or a bookmark or a link to load the application with an anchor included, the application must restore the appropriate state.
So there you have it: Anchors provide Javascript programmers with a mechanism for making bookmarkable, indexable, and back-button-friendly applications. This technique has a name: It is a Single Page Interface.
p.s. There is a fourth benefit to this technique: Loading page content through AJAX and then injecting it into the current DOM can be much faster than loading a new page. In addition to the speed increase, further tricks like loading certain portions in the background can be performed under the programmer's control.
p.p.s. Given all of that, the 'bang' or exclamation mark is a further hint to Google's web crawler that the exact same page can be loaded from the server at a slightly different URL. See Ajax Crawling. Another technique is to make each link point to a server-accessible URL and then use unobtrusive Javascript to change it into an SPI with an anchor.
Here's the key link again: The Single Page Interface Manifesto
First of all: I'm the author of the The Single Page Interface Manifesto cited by raganwald
As raganwald has explained very well, the most important aspect of the Single Page Interface (SPI) approach used in FaceBook and Twitter is the use of hash # in URLs
The character ! is added only for Google purposes, this notation is a Google "standard" for crawling web sites intensive on AJAX (in the extreme Single Page Interface web sites). When Google's crawler finds an URL with #! it knows that an alternative conventional URL exists providing the same page "state" but in this case on load time.
In spite of #! combination is very interesting for SEO, is only supported by Google (as far I know), with some JavaScript tricks you can build SPI web sites SEO compatible for any web crawler (Yahoo, Bing...).
The SPI Manifesto and demos do not use Google's format of ! in hashes, this notation could be easily added and SPI crawling could be even easier (UPDATE: now ! notation is used and remains compatible with other search engines).
Take a look to this tutorial, is an example of a simple ItsNat SPI site but you can pick some ideas for other frameworks, this example is SEO compatible for any web crawler.
The hard problem is to generate any (or selected) "AJAX page state" as plain HTML for SEO, in ItsNat is very easy and automatic, the same site is in the same time SPI or page based for SEO (or when JavaScript is disabled for accessibility). With other web frameworks you can ever follow the double site approach, one site is SPI based and another page based for SEO, for instance Twitter uses this "double site" technique.
I would be very careful if you are considering adopting this hashbang convention.
Once you hashbang, you can’t go back. This is probably the stickiest issue. Ben’s post put forward the point that when pushState is more widely adopted then we can leave hashbangs behind and return to traditional URLs. Well, fact is, you can’t. Earlier I stated that URLs are forever, they get indexed and archived and generally kept around. To add to that, cool URLs don’t change. We don’t want to disconnect ourselves from all the valuable links to our content. If you’ve implemented hashbang URLs at any point then want to change them without breaking links the only way you can do it is by running some JavaScript on the root document of your domain. Forever. It’s in no way temporary, you are stuck with it.
You really want to use pushState instead of hashbangs, because making your URLs ugly and possibly broken -- forever -- is a colossal and permanent downside to hashbangs.
To have a good follow-up about all this, Twitter - one of the pioneers of hashbang URL's and single-page-interface - admitted that the hashbang system was slow in the long run and that they have actually started reversing the decision and returning to old-school links.
Article about this is here.
I always assumed the ! just indicated that the hash fragment that followed corresponded to a URL, with ! taking the place of the site root or domain. It could be anything, in theory, but it seems the Google AJAX Crawling API likes it this way.
The hash, of course, just indicates that no real page reload is occurring, so yes, it’s for AJAX purposes. Edit: Raganwald does a lovely job explaining this in more detail.

What does the #!/ do for SEO in the URL? [duplicate]

I've just noticed that the long, convoluted Facebook URLs that we're used to now look like this:
http://www.facebook.com/example.profile#!/pages/Another-Page/123456789012345
As far as I can recall, earlier this year it was just a normal URL-fragment-like string (starting with #), without the exclamation mark. But now it's a shebang or hashbang (#!), which I've previously only seen in shell scripts and Perl scripts.
The new Twitter URLs now also feature the #! symbols. A Twitter profile URL, for example, now looks like this:
http://twitter.com/#!/BoltClock
Does #! now play some special role in URLs, like for a certain Ajax framework or something since the new Facebook and Twitter interfaces are now largely Ajaxified?
Would using this in my URLs benefit my Web application in any way?
This technique is now deprecated.
This used to tell Google how to index the page.
https://developers.google.com/webmasters/ajax-crawling/
This technique has mostly been supplanted by the ability to use the JavaScript History API that was introduced alongside HTML5. For a URL like www.example.com/ajax.html#!key=value, Google will check the URL www.example.com/ajax.html?_escaped_fragment_=key=value to fetch a non-AJAX version of the contents.
The octothorpe/number-sign/hashmark has a special significance in an URL, it normally identifies the name of a section of a document. The precise term is that the text following the hash is the anchor portion of an URL. If you use Wikipedia, you will see that most pages have a table of contents and you can jump to sections within the document with an anchor, such as:
https://en.wikipedia.org/wiki/Alan_Turing#Early_computers_and_the_Turing_test
https://en.wikipedia.org/wiki/Alan_Turing identifies the page and Early_computers_and_the_Turing_test is the anchor. The reason that Facebook and other Javascript-driven applications (like my own Wood & Stones) use anchors is that they want to make pages bookmarkable (as suggested by a comment on that answer) or support the back button without reloading the entire page from the server.
In order to support bookmarking and the back button, you need to change the URL. However, if you change the page portion (with something like window.location = 'http://raganwald.com';) to a different URL or without specifying an anchor, the browser will load the entire page from the URL. Try this in Firebug or Safari's Javascript console. Load http://minimal-github.gilesb.com/raganwald. Now in the Javascript console, type:
window.location = 'http://minimal-github.gilesb.com/raganwald';
You will see the page refresh from the server. Now type:
window.location = 'http://minimal-github.gilesb.com/raganwald#try_this';
Aha! No page refresh! Type:
window.location = 'http://minimal-github.gilesb.com/raganwald#and_this';
Still no refresh. Use the back button to see that these URLs are in the browser history. The browser notices that we are on the same page but just changing the anchor, so it doesn't reload. Thanks to this behaviour, we can have a single Javascript application that appears to the browser to be on one 'page' but to have many bookmarkable sections that respect the back button. The application must change the anchor when a user enters different 'states', and likewise if a user uses the back button or a bookmark or a link to load the application with an anchor included, the application must restore the appropriate state.
So there you have it: Anchors provide Javascript programmers with a mechanism for making bookmarkable, indexable, and back-button-friendly applications. This technique has a name: It is a Single Page Interface.
p.s. There is a fourth benefit to this technique: Loading page content through AJAX and then injecting it into the current DOM can be much faster than loading a new page. In addition to the speed increase, further tricks like loading certain portions in the background can be performed under the programmer's control.
p.p.s. Given all of that, the 'bang' or exclamation mark is a further hint to Google's web crawler that the exact same page can be loaded from the server at a slightly different URL. See Ajax Crawling. Another technique is to make each link point to a server-accessible URL and then use unobtrusive Javascript to change it into an SPI with an anchor.
Here's the key link again: The Single Page Interface Manifesto
First of all: I'm the author of the The Single Page Interface Manifesto cited by raganwald
As raganwald has explained very well, the most important aspect of the Single Page Interface (SPI) approach used in FaceBook and Twitter is the use of hash # in URLs
The character ! is added only for Google purposes, this notation is a Google "standard" for crawling web sites intensive on AJAX (in the extreme Single Page Interface web sites). When Google's crawler finds an URL with #! it knows that an alternative conventional URL exists providing the same page "state" but in this case on load time.
In spite of #! combination is very interesting for SEO, is only supported by Google (as far I know), with some JavaScript tricks you can build SPI web sites SEO compatible for any web crawler (Yahoo, Bing...).
The SPI Manifesto and demos do not use Google's format of ! in hashes, this notation could be easily added and SPI crawling could be even easier (UPDATE: now ! notation is used and remains compatible with other search engines).
Take a look to this tutorial, is an example of a simple ItsNat SPI site but you can pick some ideas for other frameworks, this example is SEO compatible for any web crawler.
The hard problem is to generate any (or selected) "AJAX page state" as plain HTML for SEO, in ItsNat is very easy and automatic, the same site is in the same time SPI or page based for SEO (or when JavaScript is disabled for accessibility). With other web frameworks you can ever follow the double site approach, one site is SPI based and another page based for SEO, for instance Twitter uses this "double site" technique.
I would be very careful if you are considering adopting this hashbang convention.
Once you hashbang, you can’t go back. This is probably the stickiest issue. Ben’s post put forward the point that when pushState is more widely adopted then we can leave hashbangs behind and return to traditional URLs. Well, fact is, you can’t. Earlier I stated that URLs are forever, they get indexed and archived and generally kept around. To add to that, cool URLs don’t change. We don’t want to disconnect ourselves from all the valuable links to our content. If you’ve implemented hashbang URLs at any point then want to change them without breaking links the only way you can do it is by running some JavaScript on the root document of your domain. Forever. It’s in no way temporary, you are stuck with it.
You really want to use pushState instead of hashbangs, because making your URLs ugly and possibly broken -- forever -- is a colossal and permanent downside to hashbangs.
To have a good follow-up about all this, Twitter - one of the pioneers of hashbang URL's and single-page-interface - admitted that the hashbang system was slow in the long run and that they have actually started reversing the decision and returning to old-school links.
Article about this is here.
I always assumed the ! just indicated that the hash fragment that followed corresponded to a URL, with ! taking the place of the site root or domain. It could be anything, in theory, but it seems the Google AJAX Crawling API likes it this way.
The hash, of course, just indicates that no real page reload is occurring, so yes, it’s for AJAX purposes. Edit: Raganwald does a lovely job explaining this in more detail.

Rails Sanitize: Safety + Allowing Embeds

We're building a user generated content site where we want to allow users to be able to embed things like videos, slideshares, etc... Can anyone recommend a generally accepted list of tags / attributes to allow in rails sanitize that will give us pretty good security, while still allowing a good amount of the embedable content / html formatting?
As long as is turned off, you should be able to allow objects. You might even be able to to define the actual acceptable parameters of object tags, so that you only allow a whitelist, and abitrary objects cannot be included.
However, it may be better to provide some UI support for embedding. For example, I prompt the user for a YouTube URL and then derive the embed code for the video from that.
Several benefits:
- the default YouTube code is not standards compliant so I can construct my own Object code
- I have complete control over the way embedded elements are included in the output page
Honestly saying allowing users to use WYSIWYG Html editors might sound good, but in practice it just doesn't work well for both users and developers. The reasons are:
Still too different behavior in different browsers.
Whitelist allows to secure site, but users will end up calling and asking to allow another parameter for OBJECT tag or similar. Blacklists are just not secure.
Not many users know what HTML tag is.
For users it is hard to format text (how can you tell them to use HEADER instead of BOLD+FONT-SIZE).
Generally it is pretty painful and you cannot really change the site design if needed because of users did not properly use HTML.
If I would be doing CMS-like system now, I would probably go with semantic markup.
Most users, get used to it quickly and it is just plain text (as here at SO).
Also YOU can generate proper HTML and support needed tags.
For example if you need to embed picture you might write something like:
My Face:image-http://here.there/bla.gif
Which would generate HTML for you like this:
<a class='image-link' title='My Face' href='http://here.there/bla.gif'>
<img alt='My Face' src='http://here.there/bla.gif' />
</a>
There are plenty of markup languages around, so just pick the one is more appropriate to you and add your own modifications.
For example, GitHub uses modified markdown and the code to parse it is just a couple of lines.
One disadvantage is that users need to learn the language and it is NOT WYSIWYG.
Regards,
Dmitriy.
There's a great project for this. It even has embed-analysis to only allow youtube embeds, for example
https://github.com/rgrove/sanitize

Why would I put ?src= in a link?

I feel dumb for not knowing this, but I see a lot of links in web pages and instead of this:
<a href="http://foo.com/">
...they use this:
<a href="http://foo.com/?src=bar.com">
Now I understand that the ?src= is telling something that this referral is coming from bar.com, but I don't understand why this needs to be called out explicitly. Can anyone shed some light on it for me? Is this something I need to include in my program generated links?
EDIT: Ok, sorry, I'm not being clear enough. I understand the GET syntax with a question mark and parameters separated by ampersands. I'm wondering what's this special src parameter? Why would one site link to another and tack an src parameter on the end even though there's no indication that the destination site uses this normally.
For example, on this page hover your mouse over the screenshot. The link URL is http://moms4mom.com/?src=stackexchangesites
But moms4mom.com is our site. Passing the src parameter does nothing, so why include it?
There are a few reasons that the src is being used explicitly. But in general, it is easier and more reliable to trust a query string to determine referer[sic] than it is to trust the referer, since the latter is often broken, deliberately or not. On the other hand, browsers almost never break the query string in a url, since this, unlike referers, is pretty important for pages to function. Besides, a referer is often done without any deliberate action on the part of the site doing the refering, which some users dislike.
The reason (I do it) is that popular analytics tools sometimes make it easier to filter on query strings than referrers.
There is no standard to the src parameter. Each site has its own and it's usually up to the site that gets the link to define how it wants to read it (as usually it's that site that's going to pay for the click).
The second is a dynamic link, it's a URL that another language(like ASP and PHP) interpret as something to do, like in those Google URLs, but i never used this site(foo.com), then i don't much things about this parameter.
Depending on how the site processes its URL, you may or may not need to include the ?... information.
This is passed to the website, and the server can process it just like form input. Some sites require this - and build their navigation off a single page, using nothing but the "extra" stuff passed afterwards. If you're generating a link to a site like that, it will be required.
In other cases, this is just used to pass extra, unrequired info (such as advertising, tracking info, etc)... In those cases, you can leave it off.
Unfortunately, there's no way to know without trying whether you can remove the "extra" bits from the URL.
After reading some of your comments - I'll also say:
There is nothing special about the "src" field in a query string. The server is free to use it any way it wishes. Unless you know specific info about the server, you cannot assume it can be left out.
The part after the ? is the query string. Different sites use it for different things, and it is usually used for passing information to the server side code for that URL, but can also be used in javascript.
For more info see Query String

Resources