Nokogiri unescaped html - ruby-on-rails

I am parsing HTML text using nokogiri and making some changes to that HTML.
doc = Nokogiri::HTML.parse(html_code)
But i am using mustache with that html so the html contains mustache variables which are in enclosed in
curly braces e.g.{{mustache_variable}}.
After tinkering with the nokogiri document, when i do
doc.to_html
These curly braces are escaped and i get something like %7B%7Bmustache_variable%7D%7D
But, not all of the content is escaped, e.g. if i have html as
<label> {{mustache_variable}} </label>
It returns, <label> {{mustache_variable}} </label>
But for html like, <img src='{{mustache_variable}}'>
It returns, <img src='%7B%7Bmustache_variable%7D%7D'>
So, i am currently doing a gsub to replace %7B and %7D with { and } respectively so mustache works.
So, is there a way i can get the exact html from nokogiri or a better solution ???

Probably you need cgi module
require 'cgi'
doc = Nokogiri::HTML.parse(html_code)
CGI.unescapeHTML(doc.to_html)
or you can use htmlentities lib.
And try to use doc.content instead of doc.to_html

I ran into this same problem and ended up using a regular expression to convert the escaped double braces:
html_doc.gsub(/%7B%7B(.+?)%7D%7D/, '{{\1}}')
To make this safer, I'd recommend prefixing each mustache variable with a namespace, just in case some of the HTML does have the escaped double brace pattern intentionally, e.g.
html_doc.gsub(/%7B%7Bnamespace(.+?)%7D%7D/, '{{namespace\1}}')

Related

Specify full path for image source in Ruby on Rails

I have HTML documents that contain image tags . I need to pick out each image tag's source attribute and specify a full path instead of the relative path already present . That is to append the absolute path .
Current Version :
<img src = '/assets/rails.png' />
After transformation :
<img src = 'http://localhost:3000/assets/rails.png' />
What would be the cleanest and most efficient way to do this in RoR ?
Addition
I am going to use the transformed HTML as a string and pass it to IMgKit gem for transformation into an image .
It's hard to figure out if you mean you have HTML templates, such as HAML or ERB, or real HTML files. If you're trying to manipulate HTML files, you should use Nokogiri to parse and change the src parameters:
require 'nokogiri'
require 'uri'
html = '<html><body><img src="/path/to/image1.jpg"><img src="/path/to/image2.jpg"></body></html>'
doc = Nokogiri.HTML(html)
doc.search('img[src]').each do |img|
img['src'] = URI.join('http://localhost:3000', img['src']).to_s
end
puts doc.to_html
Which outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<img src="http://localhost:3000/path/to/image1.jpg"><img src="http://localhost:3000/path/to/image2.jpg">
</body></html>
You could do the manipulation of the src parameters various ways, but the advantage to using URI, is it's aware of the various twists and turns that URLs need to follow. Rewriting the parameter using gsub or text manipulation requires you to pay attention to all that, and, unexpected encoding problems can creep in.
You can create a helper method so you can use it everywhere
def full_image_path(image)
request.protocol + request.host_with_port + image.url
end

How to correctly state internationalization keys/values in YAML files?

I am using Ruby on Rails 3.1.0 and I would like to know how to correctly state internationalization keys/values in YAML files (I have a couple of questions/doubts...). That is, I have a locale file containing the following code:
en:
# Note the 'style' HTML property and the ':' at the end
test_key_html: <span style='color: #4682B4;'>Test text</span>:
How should I correctly add colon (punctuation) to a YAML file (maybe by using HTML code)?
How should I properly state the HTML 'style' property in the YAML file? What do you advise about?
Those translation files aren't meant to have HTML. I would avoid having the entire HTML string in there, instead just have the string "Test text" and move the html portion back into your template or helper.

Rails' strip_tags is not even close to being as good as PHP's strip_tags?

The rails' version of strip_tags() doesn't seem to remove javascript and css code blocks?
Or am I missing something?
You may want to use the Sanitize gem for this which as standard strips out everything and just leaves plain text.
The example from GitHub is ...
html = '<b>foo</b><img src="http://foo.com/bar.jpg">'
Sanitize.clean(html) # => 'foo'

Rails with backbone-rails: asset helpers (image_path) in EJS files

I have a Rails 3.1 app that uses the codebrew/backbone-rails. In a .jst.ejs template, I would like to include an image, like so:
<img src="<%= image_path("foo.png") %>"/>
But of course the asset helpers are not available in JavaScript.
Chaining ERB (.jst.ejs.erb) does not work, because the EJS syntax conflicts with ERB.
Here is what I know:
The asset helpers are not available in the browser, so I need to run them on the server side.
I can work around the problem by making the server dump various asset paths into the HTML (through data attributes or <script> and JSON) and reading them back in JS, but this seems rather kludgy.
Is there a way to somehow use the asset helpers in EJS files?
There is a way, actually, to chain a .jst.ejs.erb file, although it's fairly undocumented, and I only found it through looking at the EJS test cases. You can tell EJS to use {{ }} (or [% %] or whatever else you want) instead of <% %>, and then ERB won't try to evaluate your EJS calls.
Make sure to require EJS somewhere in your code (I just included gem 'ejs' in my Gemfile), and then create an initializer (I called it ejs.rb) that includes the following:
EJS.evaluation_pattern = /\{\{([\s\S]+?)\}\}/
EJS.interpolation_pattern = /\{\{=([\s\S]+?)\}\}/
Then just make sure to rename your templates to .jst.ejs.erb, and replace your existing <% %> EJS-interpreted code with {{ }}. If you want to use something other than {{ }}, change the regular expressions in the initializer.
I wish there were an option in Sprockets to handle this through the config rather than having to explicitly include EJS, but as of the moment, there's no way to do that that I know of.
I can see two ways. Neither are great.
When you say <%%= variable %> then this is rendered by ERB as <%= variable %>, so you could double percent escape everything but the asset_tags and that would survive the trip through one ERB pass on the way to EJS.
If you find that too gross...
How about making a different javascript file, with an ERB extension, that defines your asset paths? And then use the asset pipeline to require that.
So say assets.js.erb defines something like:
MyAssets = {
'foo': <%= image_path("foo.png") %>,
...
}
And then require this somewhere near the top of your manifest. And then reference the globals however that works in EJS.
For those willing to try HAML instead of EJS: Using haml-coffee through haml_coffee_assets has worked well for me as well.
You can have the following in a .hamlc.erb file:
%img(src="<%= image_path('foo.png') %>")
(It still doesn't give you routing helpers though, only asset helpers.)
Ryan Fitzgerald was kind enough to post a gist of his JavaScript asset helpers (which get precompiled with ERB): https://gist.github.com/1406349
You can use corresponding Javascript helper via the following gem:
https://github.com/kavkaz/js_assets
Finally (after installing and configuring) you will be able to use it like this:
<img src="<%= asset_path("foo.png") %>"/>

Strip Inline CSS and JavaScript in Rails

I'm working on a Rails application and I would like to know what's the best way to strip blocks of CSS or JavaScript.
<style>
...
</style>
-or-
<script>
...
</script>
I'm using the strip_tags helper to take care of most of the HTML, but it leaves a bunch of CSS when the content contains inline CSS. Thanks
Try to use Nokogiri library:
require 'nokogiri'
str = " ... " # some html from user
doc = Nokogiri::HTML(str)
doc.css("style,script").remove # remove all tags with content
new_string = doc.to_s
Nokogiri can much more, but this is what you asked for in questions :-)
The recommended way to do this is using the sanitize method. The strip_tags method is somewhat limited and less secure:
[strip_tags] Strips all HTML tags from the html,
including comments. This uses the
html-scanner tokenizer and so its HTML
parsing ability is limited by that of
html-scanner.
If you use sanitize, you will be much more secure, just come up with a white list of tags you intend to allow first.
If you need user-provided CSS for your application, you can try using http://github.com/courtenay/css_file_sanitize/tree/master as well.

Resources