Rails Sanitize: Safety + Allowing Embeds - ruby-on-rails

We're building a user generated content site where we want to allow users to be able to embed things like videos, slideshares, etc... Can anyone recommend a generally accepted list of tags / attributes to allow in rails sanitize that will give us pretty good security, while still allowing a good amount of the embedable content / html formatting?

As long as is turned off, you should be able to allow objects. You might even be able to to define the actual acceptable parameters of object tags, so that you only allow a whitelist, and abitrary objects cannot be included.
However, it may be better to provide some UI support for embedding. For example, I prompt the user for a YouTube URL and then derive the embed code for the video from that.
Several benefits:
- the default YouTube code is not standards compliant so I can construct my own Object code
- I have complete control over the way embedded elements are included in the output page

Honestly saying allowing users to use WYSIWYG Html editors might sound good, but in practice it just doesn't work well for both users and developers. The reasons are:
Still too different behavior in different browsers.
Whitelist allows to secure site, but users will end up calling and asking to allow another parameter for OBJECT tag or similar. Blacklists are just not secure.
Not many users know what HTML tag is.
For users it is hard to format text (how can you tell them to use HEADER instead of BOLD+FONT-SIZE).
Generally it is pretty painful and you cannot really change the site design if needed because of users did not properly use HTML.
If I would be doing CMS-like system now, I would probably go with semantic markup.
Most users, get used to it quickly and it is just plain text (as here at SO).
Also YOU can generate proper HTML and support needed tags.
For example if you need to embed picture you might write something like:
My Face:image-http://here.there/bla.gif
Which would generate HTML for you like this:
<a class='image-link' title='My Face' href='http://here.there/bla.gif'>
<img alt='My Face' src='http://here.there/bla.gif' />
</a>
There are plenty of markup languages around, so just pick the one is more appropriate to you and add your own modifications.
For example, GitHub uses modified markdown and the code to parse it is just a couple of lines.
One disadvantage is that users need to learn the language and it is NOT WYSIWYG.
Regards,
Dmitriy.

There's a great project for this. It even has embed-analysis to only allow youtube embeds, for example
https://github.com/rgrove/sanitize

Related

URL structure for multilingual websites

I'm developing a SPA web app and it will support various languages. It is build with AngularJS and I am using angular-translate to provide i18n.
But I am struggling a little bit with how the URL structure should be. I do no plan on using either gTLDs nor ccTLDs, so that leaves me with three options.
Use query params: ?locale=en-us
Use url paths: /en-us/page
Store the chosen locale in localStorage or a cookie
The first option is a no-go according to Google's guidelines for web apps SEO. So that leaves me with the last two options.
I have a hard time deciding which is more beneficial, though I am inclined to believe that using url paths would probably be more crawler friendly.
P.S: Not sure if this is the best place to ask such a question either.
The second option is your safest bet as according to https://webmasters.stackexchange.com/questions/59652/what-happens-if-i-try-to-set-a-cookie-on-a-bot cookies are ignored. You can test this yourself by going to the Google Console and fetching your website.
As of now most crawlers ignore cookies and DO NOT execute JavaScript. This means that they usually just download the html and make their judgements from there.
Some developers get around the no javascript problem by pre-rendering parts of their content. I haven't done it personally but you might want to check out https://prerender.io/
Edit
As rolandjitsu mentioned google crawls and executes javascript content.
You should go with second option: provide the language tag (and, optionally, region subtags) in the URL path as first segment.
For the simple reason that it allows you, visitors, and bots to link to specific translations.

Prevent XSS attacks and still use Html.Raw

I have CMS system where I am using CK Editor to enter data. Now if user types in <script>alert('This is a bad script, data');</script> then CKEditor does the fair job and encodes it correctly and passes <script>alert('This is a bad script, data')</script> to server.
But if user goes into browser developer tools (using Inspect element) and adds this inside it as shown in the below screen shot then this is when all the trouble starts. Now after retrieving back from DB when this is displayed in Browser it presents alert box.
So far I have tried many different things one them is
Encode the contents using AntiXssEncoder [HttpUtility.HtmlEncode(Contents)] and then store it in database and when displaying back in browser decode it and display it using MvcHtmlString.Create [MvcHtmlString.Create(HttpUtility.HtmlDecode(Contents))] or Html.Raw [Html.Raw(Contents)] as you may expect both of them displays JavaScript alert.
I don't want to replace the <script> manually thru code as it is not comprehensive solution (search for "And the encoded state:").
So far I have referred many articles (sorry not listing them all here but just adding few as proof to show I have put sincere efforts before writing this question) but none of them have code which shows the answer. May be there is some easy answer and I am not looking in right direction or may be it is not that simple at all and I may need to use something like Content Security Policy.
ASP.Net MVC Html.Raw with AntiXSS protection
Is there a risk in using #Html.Raw?
http://blog.simontimms.com/2013/01/21/content-security-policy-for-asp-net-mvc/
http://blog.michaelckennedy.net/2012/10/15/understanding-text-encoding-in-asp-net-mvc/
To reproduce what I am saying go to *this url and in the text box type <script>alert('This is a bad script, data');</script> and click the button.
*This link is from Michael Kennedy's blog
It isn't easy and you probably don't want to do this. May I suggest you use a simpler language than HTML for end user formatted input? What about Markdown which (I believe) is used by Stackoverflow. Or one of the existing Wiki or other lightweight markup languages?
If you do allow Html, I would suggest the following:
only support a fixed subset of Html
after the user submits content, parse the Html and filter it against a whitelist of allowed tags and attributes.
be ruthless in filtering and eliminating anything that you aren't sure about.
There are existing tools and libraries that do this. I haven't used it, but I did stumble on http://htmlpurifier.org/. I assume there are many others. Rick Strahl has posted one example for .NET, but I'm not sure if it is complete.
About ten years ago I attempted to write my own whitelist filter. It parsed and normalized the entered Html. Then it removed any elements or attributes that were not on the allowed whitelist. It worked pretty well, but you never know what vulnerabilities you've missed. That project is long dead, but if I had to do it over I would have used an existing simpler markup language rather than Html.
There are so many ways for users to inject nasty stuff into your pages, you have to be fierce to prevent this. Even CSS can be used to inject executable expressions into your page, like:
<STYLE type="text/css">BODY{background:url("javascript:alert('XSS')")}</STYLE>
Here is a page with a list of known attacks that will keep you up at night. If you can't filter and prevent all of these, you aren't ready for untrusted users to post formatted content viewable by the public.
Right around the time I was working on my own filter, MySpace (wow I'm old) was hit by an XSS Worm known as Samy. Samy used Style attributes with embedded background Url that had a javascript payload. It is all explained by the author.
Note that your example page says:
This page is meant to accept and display raw HTML by trusted
editors.
The key issue here is trust. If all of your users are trusted (say employees of a web site), then the risk here is lower. However, if you are building a forum or social network or dating site or anything that allows untrusted users to enter formatted content that will be viewable by others, you have a difficult job to sanitize Html.
I managed to resolve this issue using the HtmlSanitizer in NuGet:
https://github.com/mganss/HtmlSanitizer
as recommended by the OWASP Foundation (as good a recommendation as I need):
https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.236_-_Sanitize_HTML_Markup_with_a_Library_Designed_for_the_Job
First, add the NuGet Package:
> Install-Package HtmlSanitizer
Then I created an extension method to simplify things:
using Ganss.XSS;
...
public static string RemoveHtmlXss(this string htmlIn, string baseUrl = null)
{
if (htmlIn == null) return null;
var sanitizer = new HtmlSanitizer();
return sanitizer.Sanitize(htmlIn, baseUrl);
}
I then validate within the controller when the HTML is posted:
var cleanHtml = model.DodgyHtml.RemoveHtmlXss();
AND for completeness, sanitise whenever you present it to the page, especially when using Html.Raw():
<div>#Html.Raw(Model.NotSoSureHtml.RemoveHtmlXss())</div>

How iOS Google Now can show different card template

I wanted to know the technology decision behind the iOS Google app.
As we can see, in the app's Google Now feature it renders many different card templates for different scenarios, and those templates seems to be very flexible based on server inputs.
I was wondering if this is implemented all based on HTML5? or they just have many templates built in and render them locally? I'd vote for the HTML5 route but not sure if this still involved some native code to make it more responsive?
Thanks!
As we (well, most of the community) are not Google employees we can't tell you what they really did, but I'd say that it is possible to do this dynamically in the app.
We did develop something similar that responds to definitions sent by the server and transforms them to custom designed forms following basic rules.
Google reuses the design of those cards for different plattforms, the easiest solution should be showing some WebView and using HTML5.
I agree with Kevin, as this answer is entirely based on personal opinion, too.
The way I would go is to create a card class which will load some JSON data and format it with HTML and CSS. Looking at each card it would be hell to format things that way natively. I mean, attributed strings is not the way to go. Too much logic for deciding which card get a bigger text or a picture.
Additionally, the top header is most likely "localized" as well, so you get the location and load a localized image. But that is Google by nature.

Canonical url and localization

In my application I have localized urls that look something like this:
http://examle.com/en/animals/elephant
http://examle.com/nl/dieren/olifant
http://examle.com/de/tiere/elefant
This question is mainly for Facebook Likes, but I guess I will hit similar problems when I start thinking about search engine crawlers.
What kind of url would you expect as canonical url? I don't want to use the exact english url, because I want that people clicking the link will be forwarded to their own language (browser setting/dependent on IP).
The IP lookup is not something that I want to do on every page hit. Besides that I would need to incorporate more 'state' in my application, because I have to check wether a user has already been forwarded to his own locale, or is browsing the english version on purpose.
I guess it will going to be something like:
http://example.com/something/animals/elephant
or maybe without any language identifier at all:
http://example.com/animals/elephant
but that is a bit harder to implement, bigger chance on url clashes in the future (in the rare case I would get a category called en or de).
Summary
What kind of url would you expect as canonical url? Is there already a standard set for this?
I know this question is a bit old, but I was facing the same issue.
I found this:
Different language versions of a single page are considered duplicates only if the main content is in the same language (that is, if only the header, footer, and other non-critical text is translated, but the body remains the same, then the pages are considered to be duplicates).
That can be found here: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
From this I can conclude that we should add locales to canonicals.
I did find one resource that recommends not using the canonical tag with localized addresses. However, Google's documentation does not specify and only mentions subdomains in another context.
There is more that that language that you need to think of.
It's typical a tuple of 3 {region, language, property}
If you only have one website then you have {region, language} only.
Every piece of content can either be different in this 3 dimensional space, or at least presented differently. But this is the same piece of content so you'd like to centralize managing of editorial signals, promotions, tracking etc etc. Think about search systems - you'd like page rank to be merged across all instances of the article, not spread thinly out.
I think there is a standard solution: Canonical URL
Put language/region into the domain name
example.com
uk.example.com
fr.example.com
Now you have a choice how you attach a cookie for subdomain (for language/region) or for domain (for user tracking)!
On every html page add a link to canonical URL
<link rel="canonical" href="http://example.com/awesome-article.html" />
Now you are done.
There certainly is no "Standard" beyond it has to be an URL. What you certainly do see on many comercial websites is exactly what you describe:
<protocol>://<server>/<language>/<more-path>
For the "language-tag" you may follow RFCs as well. I guess your 2-letter-abbrev is quite fine.
I only disagree on the <more-path> of the URL. If I understand you right you are thinking about transforming each page into a local-language URL? I would not do that. Maybe I am not the standard user, but I personally like to manually monkey around in URLs, i.e. if the URL shown is http://examle.com/de/tiere/elefant, but I don't trust the content to be translated well I would manually try http://examle.com/en/tiere/elefant -- and that would not bring me to the expected page. And since I also dislike those URLs http://ex.com/with-the-whole-title-in-the-url-so-the-page-will-be-keyworded-by-search-engines my favorite would be to just exchange the <language> part and use generic english (or any other language) for <more-path>. Eg:
http://examle.com/en/animals/elephant
http://examle.com/nl/animals/elephant
http://examle.com/de/animals/elephant
If your site is something like Wikipedia, then I would agree to your scheme of translating the <more-part> as well.
Maybe this Google's guidelines can help with your issue: https://support.google.com/webmasters/answer/189077?hl=en
It says that many websites serve users (across the world) with content targeted to users in a certain region. It is advised to use the rel="alternate" hreflang="x" attributes to serve the correct language or regional URL in Search results.

Making a website without hyperlinks

I am making a simple CMS to use in my own blog and I have been using the following code to display articles.
xmlhttp.onreadystatechange=function(){
if (xmlhttp.readyState==4 && xmlhttp.status==200){
document.getElementById("maincontent").innerHTML=xmlhttp.responseText;
}
}
What it does is it sends a request to the database and gets the content associated with the article that was clicked on and writes it to the main viewing area with ".innerHTML".
Thus I don't have actual links to other articles. I know that I can use PHP to output HTML so that it forms a link like :
<a href=getcontent.php?q=article+title>Article Title</a>
But being slightly OCD I wanted my output to be as neat as possible. Although search engine visibility is not a concern for my personal blog I intend to adapt this to a few other sites which have search engine optimization as a priority.
From what I understand, basically search engine robots follow links to index the web sites.
My question is:
Does this practice have any negative implications for search engine visibility? Also; are there other reasons for preferring one approach over the other as I see that almost every site uses the 'link' method.
The link you've written will cause a page reload. In order to leverage the standard AJAX stuff you've got at the top, you need to write the links as something along the lines of
Article Title
This assumes you have a javascript function called ajaxGet that takes an argument of the identifier for the article you're searching for.
If you were to write your entire site that way, search engines wouldn't be able to crawl you at all since they don't execute javascript. Therefore they can't get to anything off the front page. Also, even if they could follow the links, they'd have no way of referencing the page they got to since it doesn't have a unique URL. This is also annoying for users, since they can't get a link to an exact story to bookmark, send to a friend etc.

Resources