Is it possible to use the render method of a controller to render the content of a Mechanize object? I tried:
def new
a = Mechanize.new
a.get('http://flickr.com/')
render :html => a.current_page
end
which throws an error, as well as render :text => a, a.page, and a.current_page.
I understand that the render function is not expecting a Mechanize object, I just don't know what it wants and how to get it there.
I am at the beginning stages of my development and researching all web scraping frameworks for Ruby and any help would be appreciated.
Try the body method:
page = agent.get('http://www.example.net')
puts page.body[0..100]
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml"
You can also dive deeper into the document using Nokogiri's capabilities. Mechanize is built around Nokogiri, so you can get to the parsed document Nokogiri creates, then use CSS or XPath accessors to located sub-sections of the document. Once you've found what you want you can use the to_html method to have Nokogiri emit the HTML for the nodes or nodeset. See "extract single string from html using ruby/mechanize (and nokogiri)" for information.
Now, while that'll work, you might want to consider whether you're violating the terms-of-service or copyrights by reusing the content directly on your page.
Related
Specifically, I would like to import the first block of text before the table of contents from a Wikipedia page (which is public domain).
Let's say I have a Model "Resource", with an attribute x, and x is a string that is a Wikipedia link (eg. x: "http://en.wikipedia.org/wiki/Lanny_McDonald"). The first block of text on every Wikipedia page is the group of <p>...</p>'s before <div id="toc" class="toc">...</div>.
Can I write code that copies the content of these <p>...</p>'s and writes it onto my website?
This is known as Web Scraping.
Ironically follow this wikipedia link and
consider the legal ramifications etc.
Nokogiri is boss for this..
Install:
sudo gem install nokogiri -- --with-xml2-include=/usr/local/include/libxml2 --with-xml2-lib=/usr/local/lib
Usage:
There are methods to search using xpath or css which makes things simple.
# wiki_scraper.rb
require 'open-uri'
require 'nokogiri'
# Load in the url.
#doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Branch_predictor"))
# Print the first <p> element
puts #doc.xpath("/html/body/p[1]")
You could use a HttpWebRequest, to retrieve the entire page, and then parse the html. There are tools available to convert html to xhtml, at which point you could use xml libraries to parse the xhtml.
I added an upload form so people can upload HTML files to my site. How do I parse a file of HTML to create a page of content on the site? Currently, I just need to get the title and body of a file, so I thought a full-blown parser like Nokogiri would be overkill.
#this takes in a <ActionDispatch::Http::UploadedFile>
def import(file)
#code to get title and body?
end
One of many ways to do this..
You can open and read the file from your controller assuming you saved it to an object somewhere.
#content = File.read(#your_saved_object.attachment.file_name)
And then in your view ( in Haml ) :
#content-container= #content
I'm using the Sanitize gem to disallow HTML code that could be used for an XSS attack. As a side effect, the HTML also gets cleaned up. Missing closing tags get added. This would normally be fine but in many cases it changes the formatting of the content.
Ultimately, i would like to cleanup the HTML entirely but don't want to have to do this as part of securing the site against XSS.
So, are missing end tags (e.g. </font>) a potential XSS exploit? If not, how do i stop Sanitizer from trying to clean up the HTML too?
Sanitize is built on top of Nokogiri:
Because it’s based on Nokogiri, a full-fledged HTML parser, rather than a bunch of fragile regular expressions, Sanitize has no trouble dealing with malformed or maliciously-formed HTML, and will always output valid HTML or XHTML.
Emphasis mine. So the answer is "no", you have to fix your broken HTML.
Nokogiri has to fix the HTML so that it can be properly interpreted and a DOM can be built, then Sanitize will modify the DOM that Nokogiri builds, and finally that modified DOM will be serialized to get the HTML that you get to store.
If you scan through the Sanitize source, you'll see that everything ends up going through clean! and that will use Nokogiri's to_html or to_xhtml methods:
if #config[:output] == :xhtml
output_method = fragment.method(:to_xhtml)
output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML
elsif #config[:output] == :html
output_method = fragment.method(:to_html)
else
raise Error, "unsupported output format: #{#config[:output]}"
end
result = output_method.call(output_method_params)
So you get Nokogiri's version of the HTML, not simply your HTML with the bad parts removed.
Perhaps you can configure sanitize as demonstrated in the documentation:
By default, Sanitize removes all HTML. You can use one of the built-in
configs to tell Sanitize to allow certain attributes and elements:
Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# => '<b>foo</b>'
Sanitize.clean(html, Sanitize::Config::BASIC)
# => '<b>foo</b>'
Sanitize.clean(html, Sanitize::Config::RELAXED)
# => '<b>foo</b><img
src="http://foo.com/bar.jpg" />'
Or, if you’d like more control over what’s allowed, you can provide
your own custom configuration:
Sanitize.clean(html, :elements => ['a', 'span'],
:attributes => {'a' => ['href', 'title'], 'span' =>
['class']},
:protocols => {'a' => {'href' => ['http', 'https',
'mailto']}})
Quoted from wonko.com
I’m looking for a solution to send DRY multipart emails in Rails. With DRY I mean that the content for the mail is only defined once.
I’ve thought about some possible solutions but haven’t found any existing implementations.
The solutions I’ve thought about are:
load the text from I18n and apply Markdown for the html mail and apply Markdown with a special output type for the text mail where
links are put in parenthesis after the link text
bold, italic and other formatting that doesn't make sense are removed
ordered and unordered lists are maintained
generate only the html mail and convert that to text according to the above conditions
Is there any available solution out there? Which one is probably the better way to do it?
In Chapter 4 of Crafting Rails Applications, Jóse Valim walks you through how to make a "merb" handler that uses markdown with interspersed erb and can compile to text and html. Then you make a mailer generator that generates a single merb template for each of your mail actions.
You can read an excerpt from that chapter on the page I linked you to. I highly recommend buying the book.
If you're interested in using my sorry version of what he describes in that book, you can slap this in your Gemfile:
gem 'handlers', :git => "git://github.com/chadoh/handlers.git"
Be warned that I barely know what I'm doing, that I'm not versioning that gem, and that I probably won't really even maintain it. Frankly, I wish I could find someone else who was doing a better job, but I've been unsuccessful in doing so. If you want to fork my project and be the person doing that better job, go for it!
This is a PITA, but is the only way to DRY mail such that you can support both HTML (multipart) & plaintext:
Put the html email copy in a partial file in your ActionMailer view directory with the following extension: _action.html.erb
Replace "action" with whatever action name you are using.
Then create 2 more files in the same directory:
action.text.html.erb and
action.text.plain.erb
In the text.html partial:
<%= render "action.html", :locals => {:html => true} %>
In the text.plain partial:
<% content = render "action.html", :locals => {:html => false} %>
<%= strip_tags(content) %>
That works for me, though it certainly makes me want to pay the monthly service for madmimi
Use the maildown gem.
This gems does the heavy lifting of allowing you to use email.md.erb instead of email.html.erb and email.text.erb. Write it once in a sane format and have it automatically display in HTML and in Plain Text. Win.
There are some intricacies here that you'll want to look at based on your use-case, but here's some of what we did to get it working well:
Create a maildown.rb initializer to setup some sane defaults:
Maildown.allow_indentation = true # Prevents code blocks from forming when using indentiation in markdown emails.
Maildown::MarkdownEngine.set_text do |text|
text.gsub( /{:.*}\n?/, "" ) # Removes Kramdown annotations that apply classes, etc. with `{: .class }`.
This allows you to use indents in your blocks, etc. But also precludes the ability to add indents in your Plain Text. It also removes Kramdown-specific annotation from Plain Text.
Then just replace your HTML and Plain Text files with a single .md.erb file and test it out to make sure it looks good in both versions.
Note, until you remove the .html.erb and .text.erb files, it will show those first before looking for a .md.erb file. This may actually be a nice feature if you ever needed to write separate formats for a specific email (maybe a marketing one that requires more complex formatting than Markdown can provide) without having to specify anything anywhere.
Works a treat.
I am working on a simple Rails/jQuery HTML templater app which stores a series of pre-designed templates in a database (at the moment I've just saved these as partials to get the basic concept working) and on clicking 'Show code' alongside any one of these template records, a js.erb script should place the corresponding partial within 'pre' tags dynamically via JS on that page so the user can see the raw html code.
At the moment it's working but I get the rendered html coming back and not the raw HTML that I'm looking for. Here's the js:
$("div#template-view").html("<pre><code><%= escape_javascript( render :partial => "core_template") %></code></pre>");
So pray tell, what obvious thing am I missing!? :-)
Thanks
Allan
Use
$("div#template-view").text("...")
instead. This will not parse the code
The pre tag will show source code (or any text) in a reasonable approximation to it's original state, but it won't escape html for you. Unescaped html will always be rendered as html regardless of what tag it happens to be in. By escaped i mean that all the special characters are converted to their escaped versions. The rails method h will do this for you, so if you call h with the results of calling escape_javascript then it should work fine.
$("div#template-view").html("<pre><code><%= h(escape_javascript(render :partial => "core_template")) %></code></pre>");