How to replace HTML nodes using Nokogiri - ruby-on-rails

I have an HTML file, in which, all
<div class="replace-me">
</div>
must be replaced with
<video src='my_video.mov'></video>
The code is:
doc.css("div.replace-me").each do |div|
div.replace "<video src='my_video.mov'></video>"
end
It's simple, but, unfortunately, it does't work for me. Nokogiri crashes with the following error:
undefined method `children' for nil:NilClass
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/gems/1.8/gems/activesupport-2.3.5/lib/active_support/whiny_nil.rb:52:in `method_missing'
/Library/Ruby/Gems/1.8/gems/nokogiri-1.4.2/lib/nokogiri/html/document_fragment.rb:16:in `initialize'
/Library/Ruby/Gems/1.8/gems/nokogiri-1.4.2/lib/nokogiri/xml/node.rb:424:in `new'
/Library/Ruby/Gems/1.8/gems/nokogiri-1.4.2/lib/nokogiri/xml/node.rb:424:in `fragment'
/Library/Ruby/Gems/1.8/gems/nokogiri-1.4.2/lib/nokogiri/xml/node.rb:776:in `coerce'
/Library/Ruby/Gems/1.8/gems/nokogiri-1.4.2/lib/nokogiri/xml/node.rb:331:in `replace'
Replacing with a div works:
doc.css("div.replace-me").each do |div|
div.replace "<div>Test</div>"
end
Is this a Nokogiri bug, or did I do something wrong?
The same issue occurs with add_child, inner_html and other setters for this purpose.

I will quote my comment from your previous question:
This happens because of HTML strictness (HTML has a predefined set of elements). Replace Nokogiri::HTML( self.content ) with Nokogiri::XML( self.content ) and do not forget to add a DOCTYPE declaration manually later.

If you look into the log, the part you chose with Nokogiri turns nil.
Try it this way:
doc.css(".replace-me").each do |div|
div.replace "<video src='my_video.mov'></video>"
end
Or you may need to specify which element you want to replace.

I can't duplicate the problem. Granted, the question is old, but this works:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="replace-me">
</div>
EOT
It could have been a Ruby 1.8 issue, an issue with that version of Nokogiri, or something was wrong in your libXML... it's hard to say given the information in the question.
doc.at('div.replace-me').replace("<video src='my_video.mov'></video>")
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
# "<html><body>\n" +
# "<video src=\"my_video.mov\"></video>\n" +
# "</body></html>\n"

Related

How to download each zip file from a url and unpack using rails

Right now I have a URL which is populated with a list of .zip files in the browser. I am trying to use rails to download the files and then open them using Zip::File from the rubyzip gem. Currently I am doing this using the typhoeus gem:
response = Typhoeus.get("http://url_with_zip_files.com")
But the response.response_body is an HTML doc inside a string. I am new to programming so a hint in the right direction using best practices would help a lot.
response.response_body => "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<html>\n <head>\n <title>Index of /mainstream/posts</title>\n </head>\n <body>\n<h1>Index of /mainstream/posts</h1>\n<table><tr><th>Name</th><th>Last modified</th><th>Size</th><th>Description</th></tr><tr><th colspan=\"4\"><hr></th></tr>\n<tr><td>Parent Directory</td><td> </td><td align=\"right\"> - </td><td> </td></tr>\n<tr><td>1476536091739.zip</td><td align=\"right\">15-Oct-2016 16:01 </td><td align=\"right\"> 10M</td><td> </td></tr>\n<tr><td>1476536487496.zip</td><td align=\"right\">15-Oct-2016 16:04 </td><td align=\"right\"> 10M</td><td> </td></tr>"
To break this down you need to:
Get the initial HTML index page with Typhoeus
base_url = "http://url_with_zip_files.com/"
response = Typhoeus.get(base_url)
Then Use Nokogiri to parse that HTML to extract all the links to the zip files (see: extract links (URLs), with nokogiri in ruby, from a href html tags?)
doc = Nokogiri::HTML(response)
links = doc.css('a').map { |link| link['href'] }
links.map { |link| base_url + '/' + link}
# Should look like:
# links = ["http://url_with_zip_files.com/1476536091739.zip", "http://url_with_zip_files.com/1476536487496.zip" ...]
# The first link is a link to Parent Directory which you should probably drop
# looks like: "/5Rh5AMTrc4Pv/mainstream/"
links.pop
Once you have all the links: you then visit all the extracted links to download the zip files with ruby and unzip them (see: Ruby: Download zip file and extract)
links.each do |link|
download_and_parse(link)
end
def download_and_parse(zip_file_link)
input = Typhoeus.get(zip_file_link).body
Zip::InputStream.open(StringIO.new(input)) do |io|
while entry = io.get_next_entry
puts entry.name
parse_zip_content io.read
end
end
end
If you want to use Typhoeus to stream the file contents from the url to memory see the Typhoeus documentation section titled: "Streaming the response body". You can also use Typhoeus to download all of the files in paralell which would increase your performance.
I believe Nokogiri will be your best bet.
base_url = "http://url_with_zip_files.com/"
doc = Nokogiri::HTML(Typhoeus.get(base_url))
zip_array = []
doc.search('a').each do |link|
if link.attr("href").match /.+\.zip/i
zip_array << Typhoeus.get(base_url + link.attr("href"))
end
end

Rails mailer error with inline attachment

I have a very rare behavior in Action Mailer, I had implement a mailer action like 5 months ago and it was working, but yesterday, for some strange reason, it crashed.
The problem
I have a mail layout, in order to use it in all my emails, in it I render an image that is attached previously by a before filter
Layout = app/views/layouts/email.html.erb
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Visionamos</title>
<link rel="stylesheet" type="text/css" href=<%="/assets/email.css" %>>
</head>
<body>
<table>
<tr>
<td id="head">
<table>
<tr class="image">
<td><%= image_tag(attachments['visionamos.png'].url) %></td>
...
..
.
User Mailer = app/mailers/users.rb
class UsuariosMailer < ActionMailer::Base
include AbstractController::Callbacks # include controller callbacks
default :from => "monitoreo#visionamos.com"
layout "mail" #Set email layout
before_filter :add_inline_attachments! # Add image header for all actions
def otp_password(user, otp_key)
#usuario = user
#code = otp_key
email_with_name = "#{#usuario.nombre} <#{#usuario.email}>"
mail(:to => email_with_name, :subject => "One time password, Plataforma Visionamos")
end
private
def add_inline_attachments!
attachments.inline['visionamos.png'] = File.read("#{Rails.root}/app/assets/images/visionamos.png")
end
end
Now, when I try to send the email I'm, getting this error
NoMethodError - undefined method `match' for nil:NilClass:
mail (2.5.4) lib/mail/utilities.rb:112:in `unbracket'
mail (2.5.4) lib/mail/part.rb:29:in `cid'
mail (2.5.4) lib/mail/part.rb:33:in `url'
app/views/layouts/mail.html.erb:13:in `_app_views_layouts_mail_html_erb__573848672563180413_70191451095440'
<td><%= image_tag(attachments['visionamos.png'].url) %></td>
But the image is attached to the email
>> attachments['visionamos.png']
=> #<Mail::Part:70191451538040, Multipart: false, Headers: <Content-Type: image/png; filename="visionamos.png">, <Content-Transfer-Encoding: binary>, <Content-Disposition: inline; filename="visionamos.png">, <content-id: >>
My DevEnv
Mac with Maverics
Ruby 2.0 + Rails 3.2.16
Plus
The email is working in my amazon ec2 instance, in my coworkers environments (ubuntu and mac)
If I delete the image_tag method in layout, the email is sent and the image is show as attachment, no inline
Update!!!
I've tried #Gene solution but even the email is sent, the images are normal attachments, no inline, so looking deeply, I found this
>> attachments.inline['visionamos.png'].header
=> #<Mail::Header:0x00000106cf6870 #errors=[], #charset=nil, #raw_source="", #fields=[#<Mail::Field:0x00000106cf60c8 #field=#<Mail::ContentTypeField:0x00000106cf5fd8 #charset=nil, #main_type="image", #sub_type="png", #parameters={"filename"=>"visionamos.png"}, #name="Content-Type", #length=nil, #tree=nil, #element=#<Mail::ContentTypeElement:0x00000106cf5d30 #main_type="image", #sub_type="png", #parameters=[{"filename"=>"visionamos.png"}]>, #value="image/png; filename=\"visionamos.png\"", #filename="visionamos.png">, #field_order_id=23>, #<Mail::Field:0x00000106d17390 #field=#<Mail::ContentTransferEncodingField:0x00000106d172a0 #charset=nil, #name="Content-Transfer-Encoding", #length=nil, #tree=nil, #element=#<Mail::ContentTransferEncodingElement:0x00000106d16ff8 #encoding="binary">, #value="binary">, #field_order_id=24>, #<Mail::Field:0x00000106d14a78 #field=#<Mail::ContentDispositionField:0x00000106d14960 #charset=nil, #name="Content-Disposition", #length=nil, #tree=nil, #element=#<Mail::ContentDispositionElement:0x00000106d145c8 #disposition_type="inline", #parameters=[{"filename"=>"visionamos.png"}]>, #value="inline; filename=\"visionamos.png\"", #parameters={"filename"=>"visionamos.png"}, #filename="visionamos.png">, #field_order_id=26>, #<Mail::Field:0x00000106d3e8f0 #field=#<Mail::UnstructuredField:0x00000106d5ef60 #errors=[["content-id", nil, #<Mail::Field::ParseError: Mail::MessageIdsElement can not parse |<52fe636fae8a6_686085098c087130#MacBook Pro de Ruben.mail>|
Reason was: Expected one of !, #, $, %, &, ', *, +, -, /, =, ?, ^, _, `, {, |, }, ~, #, ., ", > at line 1, column 40 (byte 40) after <52fe636fae8a6_686085098c087130#MacBook>]], #charset=#, #name="content-id", #length=nil, #tree=nil, #element=nil, #value="">, #field_order_id=100>]>
The interesting part is
#<Mail::Field::ParseError: Mail::MessageIdsElement can not parse |<52fe636fae8a6_686085098c087130#MacBook Pro de Ruben.mail>|
Reason was: Expected one of !, #, $, %, &, ', *, +, -, /, =, ?, ^, _, `, {, |, }, ~, #, ., ", > at line 1, column 40 (byte 40) after <52fe636fae8a6_686085098c087130#MacBook>]],
I looked at the mail source.
This error can only occur if the content id field is nil. However calling .url should be setting the content id to an empty string unless has_content_id? is returning true, meaning there's already a content id field in the multipart header.
This is not happening, so we must have a strange case where the header object is reporting has_content_id? true yet is returning a content_id of nil.
Try setting the content id field explicitly just after you set the graphic.
attachments['visionamos.png'].header['content-id'] = 'logo.graphic'
If this works, there's still the puzzle of why it's necessary. Did you make any other changes to mailer configuration or code? Did you upgrade any gems?
Addition responding to question edit
The header parser seems to be failing because there are spaces in the id ...#MacBook Pro de Ruben.mail. Try re-naming the computer with no spaces! I guess this constitutes a bug in mail. Spaces should be elided or replaced with a legal character.
My guess is that this will also fix the original problem, and you won't need to set the content-id manually any more. Hence another guess: you changed machine name or moved development to a new machine. and that's when the bug appeared!
sudo scutil --set HostName 'mymachine'
fixed this for me.
scutil --get HostName
was returning (not set)

Nokogiri parsing for metawords

I know this question has been asked earlier but I am not able to get the parsed result. I am trying to parse metawords using nokogiri, can any one point out my mistake?
keyword = []
meta_data = doc.xpath('//meta[#name="Keywords"]/#content') #parsing for keywords
meta_data.each do |meta|
keyword << meta.value
end
key_str=keyword.join(",")
I tried running this in irb as well but keyword returns a nil.
This is how I used it in irb
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML("www.google.com")
have already tried alternatives from other stackoverflow posts like
Nokogiri html parsing question but of no use, they still return nil. I guess i am doing something wrong somewhere.
www.google.com does not have any meta keywords in the source. View Source on the page to see for yourself. So even if everything else went perfectly, you'd still get no results there.
The result of doc = Nokogiri::HTML("www.google.com") is
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>www.google.com</p></body></html>
If you want to fetch the contents of a URL, you want to use something like:
require 'open-uri'
doc = Nokogiri::HTML( open('http://www.google.com' ) )
If you get a valid HTML page, and use the proper casing on keywords to match the source, it works fine. Here's an example from my IRB session, fetching a page from one of the apps on my site that happens to use name="keywords" instead of name="Keywords":
irb(main):001:0> require 'open-uri'
#=> true
irb(main):002:0> require 'nokogiri'
#=> true
irb(main):003:0> url = "http://pentagonalrobin.phrogz.net/choose"
#=> "http://pentagonalrobin.phrogz.net/choose"
irb(main):04:0> doc = Nokogiri::HTML( open(url) ); nil # don't show doc here
#=> nil
irb(main):005:0> doc.xpath('//meta[#name="keywords"]/#content').map(&:value)
#=> ["team schedule free round-robin league"]

Mechanize - How to follow or "click" Meta refreshes in rails

I have a bit trouble with Mechanize.
When a submit a form with Mechanize. I am come to a page with one meta refresh and there is no links.
My question is how do i follow the meta refresh?
I have tried to allow meta refresh but then i get a socket error.
Sample code
require 'mechanize'
agent = WWW::Mechanize.new
agent.get("http://euroads.dk")
form = agent.page.forms.first
form.username = "username"
form.password = "password"
form.submit
page = agent.get("http://www.euroads.dk/system/index.php?showpage=login")
agent.page.body
The response:
<html>
<head>
<META HTTP-EQUIV=\"Refresh\" CONTENT=\"0;URL=index.php?showpage=m_frontpage\">
</head>
</html>
Then I try:
redirect_url = page.parser.at('META[HTTP-EQUIV=\"Refresh\"]')[
"0;URL=index.php?showpage=m_frontpage\"][/url=(.+)/, 1]
But I get:
NoMethodError: Undefined method '[]' for nil:NilClass
Internally, Mechanize uses Nokogiri to handle parsing of the HTML into a DOM. You can get at the Nokogiri document so you can use either XPath or CSS accessors to dig around in a returned page.
This is how to get the redirect URL with Nokogiri only:
require 'nokogiri'
html = <<EOT
<html>
<head>
<meta http-equiv="refresh" content="2;url=http://www.example.com/">
</meta>
</head>
<body>
foo
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
redirect_url = doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]
redirect_url # => "http://www.example.com/"
doc.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1] breaks down to: Find the first occurrence (at) of the CSS accessor for the <meta> tag with an http-equiv attribute of refresh. Take the content attribute of that tag and return the string following url=.
This is some Mechanize code for a typical use. Because you gave no sample code to base mine on you'll have to work from this:
agent = Mechanize.new
page = agent.get('http://www.examples.com/')
redirect_url = page.parser.at('meta[http-equiv="refresh"]')['content'][/url=(.+)/, 1]
page = agent.get(redirect_url)
EDIT: at('META[HTTP-EQUIV=\"Refresh\"]')
Your code has the above at(). Notice that you are escaping the double-quotes inside a single-quoted string. That results in a backslash followed by a double-quote in the string which is NOT what my sample uses, and is my first guess for why you're getting the error you are. Nokogiri can't find the tag because there is no <meta http-equiv=\"Refresh\"...>.
EDIT: Mechanize has a built-in way to handle meta-refresh, by setting:
agent.follow_meta_refresh = true
It also has a method to parse the meta tag and return the content. From the docs:
parse(content, uri)
Parses the delay and url from the content attribute of a meta tag. Parse requires the uri of the current page to infer a url when no url is specified. If a block is given, the parsed delay and url will be passed to it for further processing.
Returns nil if the delay and url cannot be parsed.
# <meta http-equiv="refresh" content="5;url=http://example.com/" />
uri = URI.parse('http://current.com/')
Meta.parse("5;url=http://example.com/", uri) # => ['5', 'http://example.com/']
Meta.parse("5;url=", uri) # => ['5', 'http://current.com/']
Meta.parse("5", uri) # => ['5', 'http://current.com/']
Meta.parse("invalid content", uri) # => nil
Mechanize treats meta refresh elements just like links without text. Thus, your code can be as simple as this:
page = agent.get("http://www.euroads.dk/system/index.php?showpage=login")
page.meta_refresh.first.click

Importing XML file in Rails app, UTF-16 encoding problem

I'm trying to import an XML file via a web page in a Ruby on Rails application, the code ruby view code is as follows (I've removed HTML layout tags to make reading the code easier)
<% form_for( :fmfile, :url => '/fmfiles', :html => { :method => :post, :name => 'Form_Import_DDR', :enctype => 'multipart/form-data' } ) do |f| %>
<%= f.file_field :document, :accept => 'text/xml', :name => 'fmfile_document' %>
<%= submit_tag 'Import DDR' %>
<% end %>
Results in the following HTML form
<form action="/fmfiles" enctype="multipart/form-data" method="post" name="Form_Import_DDR"><div style="margin:0;padding:0"><input name="authenticity_token" type="hidden" value="3da97372885564a4587774e7e31aaf77119aec62" />
<input accept="text/xml" id="fmfile_document" name="fmfile_document" size="30" type="file" />
<input name="commit" type="submit" value="Import DDR" />
</form>
The Form_Import_DDR method in the 'fmfiles_controller' is the code that does the hard work of reading the XML document in using REXML. The code is as follows
#fmfile = Fmfile.new
#fmfile.user_id = current_user.id
#fmfile.file_group_id = 1
#fmfile.name = params[:fmfile_document].original_filename
respond_to do |format|
if #fmfile.save
require 'rexml/document'
doc = REXML::Document.new(params[:fmfile_document].read)
doc.root.elements['File'].elements['BaseTableCatalog'].each_element('BaseTable') do |n|
#base_table = BaseTable.new
#base_table.base_table_create(#fmfile.user_id, #fmfile.id, n)
end
And it carries on reading all the different XML elements in.
I'm using Rails 2.1.0 and Mongrel 1.1.5 in Development environment on Mac OS X 10.5.4, site DB and browser on same machine.
My question is this. This whole process works fine when reading an XML document with character encoding UTF-8 but fails when the XML file is UTF-16, does anyone know why this is happening and how it can be stopped?
I have included the error output from the debugger console below, it takes about 5 minutes to get this output and the browser times out before the following output with the 'Failed to open page'
Processing FmfilesController#create (for 127.0.0.1 at 2008-09-15 16:50:56) [POST]
Session ID: BAh7CDoMdXNlcl9pZGkGOgxjc3JmX2lkIiVmM2I3YWU2YWI4ODU2NjI0NDM2
NTFmMDE1OGY1OWQxNSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxh
c2g6OkZsYXNoSGFzaHsABjoKQHVzZWR7AA==--dd9f588a68ed628ab398dd1a967eedcd09e505e0
Parameters: {"commit"=>"Import DDR", "authenticity_token"=>"3da97372885564a4587774e7e31aaf77119aec62", "action"=>"create", "fmfile_document"=>#<File:/var/folders/LU/LU50A0vNHA07S4rxDAOk4E+++TI/-Tmp-/CGI.3001.1>, "controller"=>"fmfiles"}
[4;36;1mUser Load (0.000350)[0m [0;1mSELECT * FROM "users" WHERE (id = 1) LIMIT 1[0m
[4;35;1mFmfile Create (0.000483)[0m [0mINSERT INTO "fmfiles" ("name", "file_group_id", "updated_at", "report_created_at", "report_link", "report_version", "option_on_open_account_name", "user_id", "option_default_custom_menu_set", "option_on_close_script", "path", "report_type", "option_on_open_layout", "option_on_open_script", "created_at") VALUES('TheTest_fp7 2.xml', 1, '2008-09-15 15:50:56', NULL, NULL, NULL, NULL, 1, NULL, NULL, NULL, NULL, NULL, NULL, '2008-09-15 15:50:56')[0m
REXML::ParseException (#<Iconv::InvalidCharacter: "਼䙍偒数 (followed by a few thousand similar looking chinese characters)
䙍偒数潲琾", ["\n"]>
/Library/Ruby/Site/1.8/rexml/encodings/ICONV.rb:7:in `conv'
/Library/Ruby/Site/1.8/rexml/encodings/ICONV.rb:7:in `decode'
/Library/Ruby/Site/1.8/rexml/source.rb:50:in `encoding='
/Library/Ruby/Site/1.8/rexml/parsers/baseparser.rb:210:in `pull'
/Library/Ruby/Site/1.8/rexml/parsers/treeparser.rb:21:in `parse'
/Library/Ruby/Site/1.8/rexml/document.rb:190:in `build'
/Library/Ruby/Site/1.8/rexml/document.rb:45:in `initialize'
Rather than a rails/mongrel problem, it sounds more likely that there's an issue either with your XML file or with the way REXML handles it. You can check this by writing a short script to read your XML file directly (rather than within a request) and seeing if it still fails.
Assuming it does, there are a couple of things I'd look at. First, I'd check you are running the latest version of REXML. A couple of years ago there was a bug (http://www.germane-software.com/projects/rexml/ticket/63) in its UTF-16 handling.
The second thing I'd check is if you're issue is similar to this: http://groups.google.com/group/rubyonrails-talk/browse_thread/thread/ba7b0585c7a6330d. If so you can try the workaround in that thread.
If none of the above helps, then please reply with more information, such as the exception you are getting when you try and read the file.
Since getting this to work requires me to only change the encoding attribute of the first XML element to have the value UTF-8 instead of UTF-16, the XML file is actually UTF-8 and labelled wrongly by the application that generates it.
The XML file is a FileMaker DDR export produced by FileMaker Pro Advanced 8.5 on OS X 10.5.4
Have you tried doing this using JRuby? I've heard Unicode strings are better supported in JRuby.
One other thing you can try is to use another XML parsing library, such as libxml ou Hpricot.
REXML is one of the slowest Ruby XML libraries you can use and might not scale.
Actually, I think your problem may be related to the problem I just detailed in this post. If I were you, I'd open it up in TextPad in Binary mode and see if there are any Byte Order Marks before your XML starts.

Resources