Why does Nokogiri ignore everything after the first attribute because of backslashes?
I'm not really sure why it's doing this:
[12] pry(Template)> b
=> "<td style=\\\"color:#fff; padding:3px; font-size:11px; text-align:center;\\\">Home Improvement Agreement: Electrical Services & Standby Generators</td>"
[13] pry(Template)> Nokogiri::HTML.parse(b).to_html
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><td style='\\\"color:#fff;' padding:3px font-size:11px text-align:center>Home Improvement Agreement: Electrical Services & Standby Generators</td></body></html>\n"
Notice how it produced bad HTML, as in everything after the color attribute in the <td> element. It closed out the attribute, and assigned the rest of the variables as HTML name tags I guess.
I'm curious if anyone knows why Nokogiri would do this, and what I can do to circumvent it?
You are asking it to parse this:
<td style=\"color:#fff; ...\">
which is not valid. This is would be valid:
<td style="color:#fff; ...">
Try:
'<td style="color:#fff; padding:3px; font-size:11px; text-align:center;">Home Improvement Agreement: Electrical Services & Standby Generators</td>'
Nokogiri makes it easy to tell whether there is a problem parsing the HTML or XML document:
require 'nokogiri'
html = '<td style=\"color:#fff; padding:3px; font-size:11px; text-align:center;\">Home Improvement Agreement: Electrical Services & Standby Generators</td>'
doc = Nokogiri::HTML.parse(html)
doc.errors
=> [#<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: htmlParseEntityRef: no name>]
Related
I am trying to scrape a site but the results returned for just the links is different from when I inspect it with the browser.
In my browser I get normal links but all the a HREF links all become javascript:void(0); from Nokogiri.
Here is the site:
https://www.ctgoodjobs.hk/jobs/part-time
Here is my code:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
response = open(url) rescue nil
next unless response
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').text
is not that easy, urls are "obscured" using a js function, that's why you're getting javascript: void(0) when asking for the hrefs... looking at the html, there are some hidden inputs for each link, and, there is a preview url that you can use to build the job preview url (if that's what you're looking for), so you have this:
<div class="result-list-job current-view">
<input type="hidden" name="job_id" value="04375145">
<input type="hidden" name="each_job_title_url" value="barista-senior-barista-咖啡調配員">
<h2 class="job-title">Barista/ Senior Barista 咖 啡 調 配 員</h2>
<h3 class="job-company">PACIFIC COFFEE CO. LTD.</h3>
<div class="job-description">
<ul class="job-desc-list clearfix">
<li class="job-desc-loc job-desc-small-icon">-</li>
<li class="job-desc-work-exp">0-1 yr(s)</li>
<li class="job-desc-salary job-desc-small-icon">-</li>
<li class="job-desc-post-date">09/11/16</li>
</ul>
</div>
<a class="job-save-btn" title="save this job" style="display: inline;"> </a>
<div class="job-batch-apply"><span class="checkbox" style="background-position: 0px 0px;"></span><input type="checkbox" class="styled" name="job_checkbox" value="04375145"></div>
<div class="job-cat job-cat-de"></div>
</div>
then, you can retrieve each job_id from those inputs, like:
inputs = doc.search('//input[#name="job_id"]')
and then build the urls (i found the base url at joblist_preview.js:
urls = inputs.map do |input|
"https://www.ctgoodjobs.hk/english/jobdetails/details.asp?m_jobid=#{input['value']}&joblistmode=previewlist&ga_channel=ct"
end
Take the output of a browser and that of a tool like wget, curl or nokogiri and you will find the HTML the browser presents can differ drastically from the raw HTML.
Browsers these days can process DHTML, Nokogiri doesn't. You can only retrieve the raw HTML using something that lets you see the content without the browser, like the above mentioned tools, then compare that with what you see in a text editor, or what nokogiri shows you. Don't trust the browser - they're known to lie because they want to make you happy.
Here's a quick glimpse into what the raw HTML contains, generated using:
$ nokogiri "https://www.ctgoodjobs.hk/jobs/part-time"
Nokogiri dropped me into IRB:
Your document is stored in #doc...
Welcome to NOKOGIRI. You are using ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin15]. Have fun ;)
Counting the hits found by the selector returns:
>> #doc.search('.job-title > a').size
30
Displaying the text found shows:
>> #doc.search('.job-title > a').map(&:text)
[
[ 0] "嬰 兒 奶 粉 沖 調 機 - 兼 職 產 品 推 廣 員 Part Time Promoter (時 薪 高 達 HK$90, 另 設 銷 售 佣 金 )",
...
[29] "Customer Services Representative (Part-time)"
]
Looking at the actual href:
>> #doc.search('.job-title > a').map{ |n| n['href'] }
[
[ 0] "javascript:void(0);",
...
[29] "javascript:void(0);"
]
You can tell the HTML doesn't contain anything but what Nokogiri is telling you, so the browser is post-processing the HTML, processing the DHTML and modifying the page you see if you use something to look at the HTML. So, the short fix is, don't trust the browser if you want to know what the server sends to you.
This is why scraping isn't very reliable and you should use an API if at all possible. If you can't, then you're going to have to roll up your sleeves and dig into the JavaScript and manually interpret what it's doing, then retrieve the data and parse it into something useful.
Your code can be cleaned up and simplified. I'd write it much more simply as:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').map(&:text)
The use of search(...).text is a big mistake. text, when applied to a NodeSet, will concatenate the text of each contained node, making it extremely difficult to retrieve the individual text. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
The first result foobar would require being split apart to be useful, and unless you have special knowledge of the content, trying to figure out how to do it will be a major pain.
Instead, use map to iterate through the elements and apply &:text to each one, returning an array of each element's text.
See "How to avoid joining all text from Nodes when scraping" and "Taking apart a DHTML page" also.
i am getting problem while reading xml files through nsxmlparser in ios,
<PRODUCTS>
<PRODUCTSLIST>
<PRODUCTDETAILS>
<headertext> test header </headertext>
<description><b style="font-size: x-small;">product, advantages</b></description>
</PRODUCTDETAILS>
</PRODUCTSLIST>
</PRODUCTS>
while i read the file using nsxmlparser i am able to get value(test header) for headertext but the description attribute value contains html tags so i cant able to get the result (<b style="font-size: x-small;">product, advantages</b>)i am getting result as empty
How can i get the result as((<b style="font-size: x-small;">product, advantages</b>)) for description attribute?
Speaking from a developers perspective I would not recommend using NSXMLParser due to it's laborious way to parse XML Files. There is a great write up about choosing the right XML Parser.
I use KissXML quite often.
You can find a quit tutorial of using it here.
Hope this helps.
Your problem is that the "b" tag is considered part of the XML structure, try escaping the '<' and '>' characters of the 'b' tag:
#"<b style=\"font-size: x-small;>product, advantages</b>"
see here
Is there a simple way to print an unformated xml string to screen in a ruby on rails application? Something like a xml beautifier?
Ruby core REXML::Document has pretty printing:
REXML::Document#write( output=$stdout, indent=-1, transitive=false, ie_hack=false )
indent: An integer. If -1, no
indenting will be used; otherwise, the
indentation will be twice this number
of spaces, and children will be
indented an additional amount. For a
value of 3, every item will be
indented 3 more levels, or 6 more
spaces (2 * 3). Defaults to -1
An example:
require "rexml/document"
doc = REXML::Document.new "<a><b><c>TExt</c><d /></b><b><d/></b></a>"
out = ""
doc.write(out, 1)
puts out
Produces:
<a>
<b>
<c>
TExt
</c>
<d/>
</b>
<b>
<d/>
</b>
</a>
EDIT: Rails has already REXML loaded, so you only have to produce new document and then write the pretty printed XML to some string which then can be embedded in a <pre> tag.
What about the Nokogiri gem? Here is an example use.
I am using the following regex
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s))
to match the name [ Burkhart, Peterson & Company ] in this
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Generally parsing (X)HTML using Regular Expressions is bad practice. Ruby has the fantastic Nokogiri Library which uses libxml2 for parsing XHTML efficiently.
Which that being said, your . does not match newlines. Use the m modifier for your regexp which tells the . to match new lines. Or the Regexp::MULTILINE constant. Documented here
Your regular expression is also capturing the HTML before the text you require.
Using nokogiri and XPath would mean you could grab the content of this table cell by referring to its CSS class. Like this:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::HTML DATA.read
p doc.at("td[#class='generalinfo_right']").text
__END__
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Which will return "Burkhart, Peterson & Company"
/m makes the dot match newlines
You'll want to use /m for multiline mode:
str.scan(/Name:</td>(.*?)</td>/m)
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s)) doesn't match the new line characters; even if it would match those characters, the (.*?) part would grab everything after </td>, including <td class="generalinfo_right">.
To make the regular expression more generic, and allow to match the exact text you want, you should change the code to
html.scan(Regexp.new(/Name:<\/td><td[^>]*>(.*?)<\/td>/s))
The regular expression could be better written, though.
I would also not suggest to parse HTML/XHTML content with regular expression.
You can verify that all the answers suggesting you add /m or Regexp::MULTILINE are correct by going to rubular.com.
I also verified the solution in console, and also modifed the regex so that it would return only the name instead of all the extra junk.
Loading development environment (Rails 2.3.8)
ree-1.8.7-2010.02 > html = '<td class="generalinfo_left" align="right">Name:</td>
ree-1.8.7-2010.02'> <td class="generalinfo_right">Burkhart, Peterson & Company</td>
ree-1.8.7-2010.02'> '
=> "<td class="generalinfo_left" align="right">Name:</td>\n<td class="generalinfo_right">Burkhart, Peterson & Company</td>\n"
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/m))
=> [["\n<td class="generalinfo_right">Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>.*<td[^>]*>(.*?)<\/td>/m))
=> [["Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 >
I've been working on a piece of software where I need to generate a custom XML file to send back to a client application. The current solutions on Ruby/Rails world for generating XML files are slow, at best. Using builder or event Nokogiri, while have a nice syntax and are maintainable solutions, they consume too much time and processing.
I definetly could go to ERB, which provides a good speed at the expense of building the whole XML by hand.
HAML is a great tool, have a nice and straight-forward syntax and is fairly fast. But I'm struggling to build pure XML files using it. Which makes me wonder, is it possible at all?
Does any one have some pointers to some code or docs showing how to do this, build a full, valid XML from HAML?
Doing XML in HAML is easy, just start off your template with:
!!! XML
which produces
<?xml version='1.0' encoding='utf-8' ?>
Then as #beanish said earlier, you "make up your own tags":
%test
%test2 hello
%item{:name => "blah"}
to get
<test>
<test2>hello</test2>
<item name='blah'></item>
</test>
More:
http://haml.info/docs/yardoc/file.REFERENCE.html#doctype_
%test
%test2 hello
%item{:name => "blah"}
run it through haml
haml hamltest.haml test.xml
open the file in a browser
<test>
<test2>hello</test2>
<item name='blah'></item>
</test>
The HAML reference talks about html tags and gives some examples.
HAML reference
This demonstrates some things that could use useful for xml documents:
!!! XML
%root{'xmlns:foo' => 'http://myns'}
-# Note: :dashed-attr is invalid syntax
%dashed-tag{'dashed-attr' => 'value'} Text
%underscore_tag Text
- ['apple', 'orange', 'pear'].each do |fruit|
- haml_tag(fruit, "Yummy #{fruit.capitalize}!", 'fruit-code' => fruit.upcase)
%foo:nstag{'foo:nsattr' => 'value'}
Output:
<?xml version='1.0' encoding='utf-8' ?>
<root xmlns:foo='http://myns'>
<dashed-tag dashed-attr='value'>Text</dashed-tag>
<underscore_tag>Text</underscore_tag>
<apple fruit-code='APPLE'>Yummy Apple!</apple>
<orange fruit-code='ORANGE'>Yummy Orange!</orange>
<pear fruit-code='PEAR'>Yummy Pear!</pear>
<foo:nstag foo:nsattr='value'></foo:nstag>
</root>
Look at the Haml::Helpers link on the haml reference for more methods like haml_tag.
If you want to use double-quotes for attributes,
See: https://stackoverflow.com/a/967065/498594
Or outside of rails use:
>> Haml::Engine.new("%tag{:name => 'value'}", :attr_wrapper => '"').to_html
=> "<tag name=\"value\"></tag>\n"
Haml can produce XML just as easily as HTML (I've used it for FBML and XHTML). What problems are you having?
I've not used HAML, but if you can't make it work another option is Builder.
what about creating the xml header, e.g. <?xml version="1.0" encoding="UTF-8"?>
?
It should be possible. After all you can create plain old XML with Notepad.